## Imbalance Classification task

### Scenario
You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

#### Round 1
    1. Import the required libraries and modules that you would need.
    2. Read that data into Python and call the dataframe churnData.
    3. Check the datatypes of all the columns in the data.You would see that the column TotalCharges is object        type. Convert this column into numeric type using pd.to_numeric function.
    4. Check for null values in the dataframe. Replace the null values.
    5. Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
    6. Split the data into a training set and a test set.
    7. Scale the features either by using MinMaxScaler or a standard scaler.
    (Optional) Encode the categorical variables so you can use them for modeling later.

##### 1.

In [107]:
import pandas as pd
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split,KFold,cross_val_score,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score
from sklearn.ensemble import RandomForestClassifier

##### 2.

In [40]:
data = pd.read_csv("DATA_Customer-Churn.txt")
data.head(100)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Female,0,No,No,12,Yes,Yes,No,No,No,No,No,Month-to-month,78.95,927.35,Yes
96,Male,0,Yes,Yes,71,Yes,Yes,Yes,No,Yes,No,No,One year,66.85,4748.7,No
97,Male,0,No,No,5,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,21.05,113.85,Yes
98,Male,0,No,No,52,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,21.00,1107.2,No


##### 3. 

In [41]:
data['Total_Charges']=data['tenure']*data['MonthlyCharges']

In [46]:
data.info()
#data.drop('TotalCharges',axis = 1, inplace = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   OnlineSecurity    7043 non-null   object 
 7   OnlineBackup      7043 non-null   object 
 8   DeviceProtection  7043 non-null   object 
 9   TechSupport       7043 non-null   object 
 10  StreamingTV       7043 non-null   object 
 11  StreamingMovies   7043 non-null   object 
 12  Contract          7043 non-null   object 
 13  MonthlyCharges    7043 non-null   float64
 14  Churn             7043 non-null   object 
 15  Total_Charges     7043 non-null   float64
dtypes: float64(2), int64(2), object(12)
memory

##### 4.

In [48]:
data.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
MonthlyCharges      0
Churn               0
Total_Charges       0
dtype: int64

##### 5.

In [55]:
X= data.loc[:,('tenure', 'SeniorCitizen', 'MonthlyCharges','Total_Charges')]
X.head()

Unnamed: 0,tenure,SeniorCitizen,MonthlyCharges,Total_Charges
0,1,0,29.85,29.85
1,34,0,56.95,1936.3
2,2,0,53.85,107.7
3,45,0,42.3,1903.5
4,2,0,70.7,141.4


##### 6.

In [80]:
y=data.loc[:,'Churn']
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2,random_state=42)
y_train = y_train.apply(lambda x: 1 if x=='Yes' else 0)
y_test = y_test.apply(lambda x: 1 if x=='Yes' else 0)

##### 7.

In [84]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#### Round 2
    (Optional) Fit a logistic Regression model on the training data.
    1.Fit a Knn Classifier (NOT KnnRegressor please!)model on the training data.
    2.Fit a Decision Tree Classifier on the training data.
    3.Compare the accuracy, precision, recall for the previous models on both the train and test sets.

##### Optional

In [85]:
logr = LogisticRegression()
logr.fit(X_train,y_train)

##### 1.

In [86]:
knc = KNeighborsClassifier()
knc.fit(X_train,y_train)

##### 2.

In [87]:
from sklearn.tree import DecisionTreeClassifier
dct = DecisionTreeClassifier()
dct.fit(X_train,y_train)

##### 3.

In [97]:
models = [logr,knc,dct]
def evaluate(model):
    y_test_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)
    accuracy_train = accuracy_score(y_train,y_train_pred) 
    precision_train = precision_score(y_train,y_train_pred)
    recall_train = recall_score(y_train,y_train_pred)
    accuracy_test = accuracy_score(y_test,y_test_pred) 
    precision_test = precision_score(y_test,y_test_pred)
    recall_test = recall_score(y_test,y_test_pred)
    
    return (accuracy_train,precision_train,recall_train,accuracy_test,precision_test,recall_test)
scores = {model:evaluate(model) for model in models} 
scores = pd.DataFrame.from_dict(scores,orient='index')
cols = ['accuracy_train','precision_train','recall_train','accuracy_test','precision_test','recall_test']
scores.columns = cols
scores

Unnamed: 0,accuracy_train,precision_train,recall_train,accuracy_test,precision_test,recall_test
LogisticRegression(),0.788427,0.648438,0.44385,0.806955,0.698039,0.477212
KNeighborsClassifier(),0.829428,0.72612,0.574198,0.772179,0.585526,0.477212
DecisionTreeClassifier(),0.983848,0.99023,0.948529,0.726757,0.484375,0.49866


#### Round 3
    Apply K-fold cross validation on your models built before,  and check the model score.

In [106]:
def cross_score(model):
    result = cross_val_score(model, X_train, y_train, cv=5,scoring='accuracy')
    return result
cross_scores = {model:cross_score(model) for model in models} 
cross_scores = pd.DataFrame.from_dict(cross_scores,orient='index')
cross_scores

Unnamed: 0,0,1,2,3,4
LogisticRegression(),0.801242,0.789707,0.771961,0.797693,0.778863
KNeighborsClassifier(),0.770186,0.767524,0.758651,0.783496,0.761989
DecisionTreeClassifier(),0.73913,0.717835,0.724933,0.715173,0.721137


#### Round 4
    1. Fit a Random forest Classifier on the data and compare the accuracy.
    2. Tune the hyper parameters with Gridsearch and check the results. 
    3. Retrain the final mode with the best parameters found.

##### 1.

In [109]:
rfc_ops = {"max_depth":6,
           "min_samples_leaf":20,
           "max_features":None,
           "n_estimators":100,
           "bootstrap":True,
           "oob_score":True,
           "random_state":42}

rfc = RandomForestClassifier(**rfc_ops)
rfc.fit(X_train,y_train)

In [112]:
pd.DataFrame(zip(cols,evaluate(rfc)))

Unnamed: 0,0,1
0,accuracy_train,0.805999
1,precision_train,0.704985
2,recall_train,0.463235
3,accuracy_test,0.804116
4,precision_test,0.699588
5,recall_test,0.455764


##### 2.