# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/modified_churn1_1.csv')
df.tail()

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,monthly_charges_tenure
7038,24,1,1,3,84.8,1990.5,0,2035.2
7039,72,1,1,1,103.2,7362.9,0,7430.4
7040,11,0,0,2,29.6,346.45,0,325.6
7041,4,1,0,3,74.4,306.6,1,297.6
7042,66,1,2,0,105.65,6844.5,0,6972.9


In [None]:

df=df.drop(['monthly_charges_tenure'],axis=1)

In [None]:
df.head()

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1,0,0,2,29.85,29.85,0
1,34,1,1,3,56.95,1889.5,0
2,2,1,0,3,53.85,108.15,1
3,45,0,1,0,42.3,1840.75,0
4,2,1,0,2,70.7,151.65,1


#using pycaret on churn data to find the best ml model

In [None]:
!pip install pycaret

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting pandas<2.2.0 (from pycaret)
  Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyod>=1.1.3 (from pycaret)
  Downloading pyod-2.0.2.tar.gz (165 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.8/165.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting category-encoders>=2.4.0 (from pycaret)
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Colle

In [None]:
from pycaret.classification import*

In [None]:
setu=setup(df,target='Churn',session_id=42)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 8)"
4,Transformed data shape,"(7043, 8)"
5,Transformed train set shape,"(4930, 8)"
6,Transformed test set shape,"(2113, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


NameError: name 'setup' is not defined

In [None]:
best_model = compare_models(sort="AUC")

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7945,0.8375,0.4962,0.647,0.5604,0.4297,0.4368,0.798
ada,Ada Boost Classifier,0.7976,0.8372,0.5161,0.6502,0.5743,0.4441,0.4498,0.236
lr,Logistic Regression,0.7957,0.8322,0.5153,0.6447,0.5721,0.4403,0.4455,1.411
lightgbm,Light Gradient Boosting Machine,0.7852,0.8265,0.5069,0.6178,0.5555,0.4159,0.4202,1.589
ridge,Ridge Classifier,0.7929,0.8193,0.4588,0.6582,0.54,0.4119,0.4235,0.059
qda,Quadratic Discriminant Analysis,0.7444,0.8193,0.7401,0.5147,0.6062,0.4265,0.4427,0.035
lda,Linear Discriminant Analysis,0.7866,0.8193,0.4932,0.6251,0.5509,0.4135,0.4188,0.063
xgboost,Extreme Gradient Boosting,0.7785,0.8174,0.5039,0.5995,0.5468,0.4018,0.4049,0.13
rf,Random Forest Classifier,0.7769,0.8064,0.4855,0.5996,0.5354,0.391,0.3954,1.35
nb,Naive Bayes,0.7037,0.7992,0.7691,0.4662,0.58,0.3723,0.4004,0.08


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

In [None]:
setu=setup(df,target='Churn',session_id=42,normalize=True)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 7)"
4,Transformed data shape,"(7043, 7)"
5,Transformed train set shape,"(4930, 7)"
6,Transformed test set shape,"(2113, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


In [None]:
best_model = compare_models(sort="AUC")

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7945,0.8376,0.4985,0.6476,0.5618,0.4309,0.438,0.539
ada,Ada Boost Classifier,0.7994,0.8374,0.5169,0.6559,0.577,0.4482,0.4542,0.603
lr,Logistic Regression,0.7955,0.8322,0.5138,0.6448,0.5711,0.4393,0.4446,0.841
lightgbm,Light Gradient Boosting Machine,0.7866,0.8265,0.5061,0.6221,0.5572,0.4188,0.4232,0.977
qda,Quadratic Discriminant Analysis,0.754,0.823,0.7393,0.5282,0.6154,0.4424,0.4563,0.063
ridge,Ridge Classifier,0.7921,0.8196,0.4557,0.657,0.5374,0.409,0.4208,0.038
lda,Linear Discriminant Analysis,0.7856,0.8196,0.4932,0.6225,0.5499,0.4117,0.4168,0.039
xgboost,Extreme Gradient Boosting,0.7773,0.816,0.513,0.596,0.5504,0.4036,0.4063,0.206
svm,SVM - Linear Kernel,0.7641,0.8084,0.4852,0.5833,0.5033,0.3582,0.373,0.047
nb,Naive Bayes,0.7128,0.8062,0.7539,0.4755,0.5825,0.3808,0.4048,0.039


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

In [None]:
final_model=finalize_model(best_model)

In [None]:
final_model

In [None]:
setu=setup(df,target='Churn',session_id=42,normalize=True)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 7)"
4,Transformed data shape,"(7043, 7)"
5,Transformed train set shape,"(4930, 7)"
6,Transformed test set shape,"(2113, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


#comparing the models

In [None]:
best_model = compare_models(sort="AUC")

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7945,0.8376,0.4985,0.6476,0.5618,0.4309,0.438,0.797
ada,Ada Boost Classifier,0.7994,0.8374,0.5169,0.6559,0.577,0.4482,0.4542,0.509
lr,Logistic Regression,0.7955,0.8322,0.5138,0.6448,0.5711,0.4393,0.4446,0.52
lightgbm,Light Gradient Boosting Machine,0.7866,0.8265,0.5061,0.6221,0.5572,0.4188,0.4232,0.846
qda,Quadratic Discriminant Analysis,0.754,0.823,0.7393,0.5282,0.6154,0.4424,0.4563,0.056
ridge,Ridge Classifier,0.7921,0.8196,0.4557,0.657,0.5374,0.409,0.4208,0.066
lda,Linear Discriminant Analysis,0.7856,0.8196,0.4932,0.6225,0.5499,0.4117,0.4168,0.037
xgboost,Extreme Gradient Boosting,0.7773,0.816,0.513,0.596,0.5504,0.4036,0.4063,0.121
svm,SVM - Linear Kernel,0.7641,0.8084,0.4852,0.5833,0.5033,0.3582,0.373,0.083
nb,Naive Bayes,0.7128,0.8062,0.7539,0.4755,0.5825,0.3808,0.4048,0.065


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

In [None]:
final_model=finalize_model(best_model)

In [None]:
final_model

In [None]:
best_model

In [None]:
df.iloc[-2:-1]

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7041,4,1,0,3,74.4,306.6,1


In [None]:
predict_model(best_model, df.iloc[-2:-1])


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
7041,4,1,0,3,74.400002,306.600006,1,1,0.5396


In [None]:
predict_model(best_model, df.iloc[-9:-1])


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,0.875,1.0,0.5,1.0,0.6667,0.6,0.6547


Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,monthly_charges_tenure,Churn,prediction_label,prediction_score
7034,67,1,0,1,102.949997,6886.25,6897.649902,1,0,0.65
7035,19,1,0,0,78.699997,1495.099976,1495.300049,0,0,0.675
7036,12,0,1,2,60.650002,743.299988,727.799988,0,0,0.841
7037,72,1,2,0,21.15,1419.400024,1522.800049,0,0,0.9875
7038,24,1,1,3,84.800003,1990.5,2035.199951,0,0,0.9146
7039,72,1,1,1,103.199997,7362.899902,7430.399902,0,0,0.9303
7040,11,0,0,2,29.6,346.450012,325.600006,0,0,0.6782
7041,4,1,0,3,74.400002,306.600006,297.600006,1,1,0.5336


#saving the best model on the disk

In [None]:
save_model(best_model, 'final_gradient_boost_classifier')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('categorical_imputer',...
                                             criterion='friedman_mse', init=None,
                      

In [None]:
import pickle
with open('final_gradient_boost_classifier.pkl', 'wb') as f:
    pickle.dump(best_model, f)

In [None]:
with open('final_gradient_boost_classifier.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

In [None]:
new_data = df.copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([0, 0, 0, ..., 0, 0, 0], dtype=int8)

In [None]:
loaded_lda = load_model('final_gradient_boost_classifier')

Transformation Pipeline and Model Successfully Loaded


In [None]:
predict_model(loaded_lda, new_data)

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,monthly_charges_tenure,prediction_label,prediction_score
0,1,0,0,2,29.850000,29.850000,29.850000,1,0.6456
1,34,1,1,3,56.950001,1889.500000,1936.300049,0,0.9398
2,2,1,0,3,53.849998,108.150002,107.699997,0,0.5251
3,45,0,1,0,42.299999,1840.750000,1903.500000,0,0.9214
4,2,1,0,2,70.699997,151.649994,141.399994,1,0.6311
...,...,...,...,...,...,...,...,...,...
7038,24,1,1,3,84.800003,1990.500000,2035.199951,0,0.9146
7039,72,1,1,1,103.199997,7362.899902,7430.399902,0,0.9303
7040,11,0,0,2,29.600000,346.450012,325.600006,0,0.6782
7041,4,1,0,3,74.400002,306.600006,297.600006,1,0.5336


In [None]:
!pip install import_ipynb
import import_ipynb
import os
path="/content/drive/My Drive/Colab Notebooks"
os.chdir(path)

Collecting import_ipynb
  Downloading import_ipynb-0.1.4-py3-none-any.whl.metadata (2.3 kB)
Downloading import_ipynb-0.1.4-py3-none-any.whl (4.1 kB)
Installing collected packages: import_ipynb
Successfully installed import_ipynb-0.1.4


In [None]:
import Predict_churn

importing Jupyter notebook from Predict_churn.ipynb
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab Notebooks


In [None]:
from Predict_churn import predict_churn
new_data = pd.read_csv('/content/drive/MyDrive/new_churn_data.csv')
probabilities = predict_churn(new_data)
print(probabilities)
true_values = [1, 0, 0, 1, 0]
print("\nTrue Values:", true_values)


Transformation Pipeline and Model Successfully Loaded


   churn_prediction
0                 1
1                 0
2                 0
3                 1
4                 0

True Values: [1, 0, 0, 1, 0]


#predicting the values for new dataset using the saved model and setting threshold

In [None]:
%run Predict_churn.ipynb


true_values = [1, 0, 0, 1, 0]
print("\nTrue Values:", true_values)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Transformation Pipeline and Model Successfully Loaded


            churn_prediction
customerID                  
9305-CKSKC                 1
1452-KNGVK                 0
6723-OKKJM                 0
7832-POPKP                 1
6348-TACGU                 0
/content/drive/MyDrive/Colab Notebooks

True Values: [1, 0, 0, 1, 0]


# Summary

Write a short summary of the process and results here.

Here i am using pycaret library which is auto ml library.it is useful in running data on multiple machine learing models with in a short time and we can choose the best models with less coding.I have done all the preprocessing and ran the setup function by keeping session id as 42 to get the same reproducability.Here i am auc for comparing the models because we dont have the uniform data and doing binary classification are some of the reasons for choosing auc.i have added normalisation to the setup to check whether there can be any change in results .But the best model remains same and it is Gradient Boosting Classifier its Accuracy-0.7945,AUC-0.8376.Here AUC means area under the roc curve can be used when the data is imbalenced and under binary classification.I have checked the model pipeline and save it in the form of pickle file on the disk.Then by reading from pickle and predicted the probability score of the data . i have written a function on ipynb file which tales data frame and thresold as input and outputs the churn value based on the threshold we have choosen. i have set the threshold to 0.7 and the values are shown are matching with the true values on the data set.

In [None]:
!cd /content/

In [None]:
path="/content/"
os.chdir(path)

In [None]:
!pwd

/content/drive/MyDrive/Colab Notebooks


In [None]:
!pwd

/content


In [None]:
def predict_churn(dataframe, training_data):
    predictions = predict_model(loaded_model, data=dataframe)
    scores = predictions['prediction_score']
    predictions['percentiles'] = scores.rank(pct=True) * 100
    return predictions['prediction_label'], scores, percentiles,predictions
new_data = pd.read_csv('/content/drive/MyDrive/new_churn_data.csv')
training_data = df.copy()
# pd.read_csv('prepared_churn_data.csv')

# Make predictions
labels, scores, percentiles,predction = predict_churn(new_data, training_data)
print("Predicted Labels:", labels)
print("Prediction Scores:", scores)
print("Prediction Percentiles:", percentiles)
print("predicted dataframe")
predction

Predicted Labels: 0    1
1    0
2    0
3    0
4    0
Name: prediction_label, dtype: int64
Prediction Scores: 0    0.6757
1    0.9038
2    0.8204
3    0.6324
4    0.7490
Name: prediction_score, dtype: float64
Prediction Percentiles: 0     40.0
1    100.0
2     80.0
3     20.0
4     60.0
Name: prediction_score, dtype: float64
predicted dataframe


Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score,percentiles
0,9305-CKSKC,22,1,0,2,97.400002,811.700012,36.895454,1,0.6757,40.0
1,1452-KNGVK,8,0,1,1,77.300003,1701.949951,212.743744,0,0.9038,100.0
2,6723-OKKJM,28,1,0,0,28.25,250.899994,8.960714,0,0.8204,80.0
3,7832-POPKP,62,1,0,2,101.699997,3106.560059,50.105808,0,0.6324,20.0
4,6348-TACGU,10,0,0,1,51.150002,3440.969971,344.096985,0,0.749,60.0


In [None]:
!pwd

/content/drive/MyDrive/Colab Notebooks


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tenure          7043 non-null   int64  
 1   PhoneService    7043 non-null   int64  
 2   Contract        7043 non-null   int64  
 3   PaymentMethod   7043 non-null   int64  
 4   MonthlyCharges  7043 non-null   float64
 5   TotalCharges    7043 non-null   float64
 6   Churn           7043 non-null   int64  
dtypes: float64(2), int64(5)
memory usage: 385.3 KB


In [None]:
!pip install tpot
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

Collecting tpot
  Downloading TPOT-0.12.2-py3-none-any.whl.metadata (2.0 kB)
Collecting deap>=1.2 (from tpot)
  Downloading deap-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting update-checker>=0.16 (from tpot)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Collecting stopit>=1.1.1 (from tpot)
  Downloading stopit-1.1.2.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading TPOT-0.12.2-py3-none-any.whl (87 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading deap-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (135 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.4/135.4 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Building wheel

In [None]:
x=df.drop(['Churn'],axis=1)
y=df["Churn"]
x_test,x_train,y_test,y_train=train_test_split(x,y,random_state=42)

In [None]:
tpot = TPOTClassifier(verbosity=2, generations=5, population_size=20)
tpot.fit(x_train, y_train)
print(tpot.score(x_test, y_test))

Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8006776332732424

Generation 2 - Current best internal CV score: 0.8006776332732424

Generation 3 - Current best internal CV score: 0.8012442055112027

Generation 4 - Current best internal CV score: 0.8012474246716457

Generation 5 - Current best internal CV score: 0.8023821787277878

Best pipeline: LinearSVC(input_matrix, C=10.0, dual=False, loss=squared_hinge, penalty=l1, tol=1e-05)
0.7854979174555092


In [None]:
tpot.score(x_train,y_train)

0.8040885860306644

In [None]:
class ChurnPredictor:
    def __init__(self, model_path):
        self.model = model_path

    def predict(self, dataframe):
        predictions = predict_model(self.model, data=dataframe)
        return predictions['prediction_label'], predictions['prediction_score'],predictions

# Usage
predictor = ChurnPredictor(loaded_model)
labels, scores,data_frame = predictor.predict(predction)
print("Predicted Labels:", labels)
print("Prediction Scores:", scores)
data_frame

Predicted Labels:    prediction_label  prediction_label
0                 1                 1
1                 0                 0
2                 0                 0
3                 0                 0
4                 0                 0
Prediction Scores:    prediction_score  prediction_score
0            0.6757            0.6757
1            0.9038            0.9038
2            0.8204            0.8204
3            0.6324            0.6324
4            0.7490            0.7490


Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score,percentiles,prediction_label.1,prediction_score.1
0,9305-CKSKC,22,1,0,2,97.400002,811.700012,36.895454,1,0.6757,40.0,1,0.6757
1,1452-KNGVK,8,0,1,1,77.300003,1701.949951,212.743744,0,0.9038,100.0,0,0.9038
2,6723-OKKJM,28,1,0,0,28.25,250.899994,8.960714,0,0.8204,80.0,0,0.8204
3,7832-POPKP,62,1,0,2,101.699997,3106.560059,50.105808,0,0.6324,20.0,0,0.6324
4,6348-TACGU,10,0,0,1,51.150002,3440.969971,344.096985,0,0.749,60.0,0,0.749
