# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [5]:
pip install pandas

Collecting pandas
  Using cached pandas-1.3.5-cp37-cp37m-win_amd64.whl (10.0 MB)
Collecting pytz>=2017.3
  Using cached pytz-2021.3-py2.py3-none-any.whl (503 kB)
Collecting numpy>=1.17.3
  Using cached numpy-1.21.5-cp37-cp37m-win_amd64.whl (14.0 MB)
Installing collected packages: pytz, numpy, pandas
  Attempting uninstall: numpy
    Found existing installation: numpy 1.16.6
    Uninstalling numpy-1.16.6:
      Successfully uninstalled numpy-1.16.6
Successfully installed numpy-1.21.5 pandas-1.3.5 pytz-2021.3
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd

In [3]:
import pandas as pd

df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1.0,0,0,0,29.85,29.85,0,14.925000
5575-GNVDE,34.0,1,1,1,56.95,1889.50,0,53.985714
3668-QPYBK,2.0,1,0,1,53.85,108.15,1,36.050000
7795-CFOCW,45.0,0,1,2,42.30,1840.75,0,40.016304
9237-HQITU,2.0,1,0,0,70.70,151.65,1,50.550000
...,...,...,...,...,...,...,...,...
6840-RESVB,24.0,1,1,1,84.80,1990.50,0,79.620000
2234-XADUH,72.0,1,1,3,103.20,7362.90,0,100.861644
4801-JZAZL,11.0,0,0,0,29.60,346.45,0,28.870833
8361-LTMKD,4.0,1,0,1,74.40,306.60,1,61.320000


In [1]:
conda install -c conda-forge pycaret -y

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [1]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [4]:
automl = setup(data = df, target = 'Churn', fold_shuffle=True, preprocess=False)

Unnamed: 0,Description,Value
0,session_id,3614
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 8)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,3
8,Transformed Train Set,"(4930, 7)"
9,Transformed Test Set,"(2113, 7)"


In [5]:
automl[6]

{'USI',
 'X',
 'X_test',
 'X_train',
 '_all_metrics',
 '_all_models',
 '_all_models_internal',
 '_available_plots',
 '_gpu_n_jobs_param',
 '_internal_pipeline',
 '_ml_usecase',
 'create_model_container',
 'data_before_preprocess',
 'display_container',
 'exp_name_log',
 'experiment__',
 'fix_imbalance_method_param',
 'fix_imbalance_param',
 'fold_generator',
 'fold_groups_param',
 'fold_param',
 'fold_shuffle_param',
 'gpu_param',
 'html_param',
 'imputation_classifier',
 'imputation_regressor',
 'iterative_imputation_iters_param',
 'log_plots_param',
 'logging_param',
 'master_model_container',
 'n_jobs_param',
 'prep_pipe',
 'pycaret_globals',
 'seed',
 'stratify_param',
 'target_param',
 'transform_target_method_param',
 'transform_target_param',
 'y',
 'y_test',
 'y_train'}

In [6]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7957,0.8403,0.4931,0.6496,0.5597,0.4302,0.4376,0.802
lda,Linear Discriminant Analysis,0.7945,0.8324,0.4523,0.6628,0.5371,0.4112,0.4239,0.016
gbc,Gradient Boosting Classifier,0.7931,0.8385,0.4892,0.6426,0.5548,0.4234,0.4305,0.249
ridge,Ridge Classifier,0.7929,0.0,0.4032,0.6824,0.5061,0.3862,0.4079,0.016
catboost,CatBoost Classifier,0.7919,0.8345,0.4954,0.6366,0.5567,0.4236,0.4296,1.692
ada,Ada Boost Classifier,0.7895,0.8356,0.4938,0.629,0.5526,0.4178,0.4233,0.252
rf,Random Forest Classifier,0.7769,0.8034,0.4792,0.5986,0.5313,0.3874,0.3921,0.306
svm,SVM - Linear Kernel,0.7635,0.0,0.3973,0.599,0.4346,0.3086,0.3294,0.03
knn,K Neighbors Classifier,0.7615,0.7503,0.4431,0.5614,0.4944,0.3414,0.346,0.032
et,Extra Trees Classifier,0.7615,0.7846,0.47,0.5596,0.5097,0.3539,0.357,0.26


In [7]:
best_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=3614, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [8]:
df.iloc[-2:-1].shape

(1, 8)

In [9]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4.0,1,0,1,74.4,306.6,1,61.32,1,0.5459


In [10]:
save_model(best_model, 'lr')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ['trained_model',
                  LogisticRegression(C=1.0, class_weight=None, dual=False,
                                     fit_intercept=True, intercept_scaling=1,
                                     l1_ratio=None, max_iter=1000,
                                     multi_class='auto', n_jobs=None,
                                     penalty='l2', random_state=3614,
                                     solver='lbfgs', tol=0.0001, verbose=0,
                                     warm_start=False)]],
          verbose=

In [11]:
import pickle

with open('lr_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [12]:
with open('lr_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [13]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1])

In [14]:
loaded_lda = load_model('lr')

Transformation Pipeline and Model Successfully Loaded


In [15]:
predict_model(loaded_lda, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4.0,1,0,1,74.4,306.6,61.32,1,0.5459


In [16]:
from IPython.display import Code

Code('predict_Churn.py')

In [17]:
%run predict_Churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC       churn
1452-KNGVK    No churn
6723-OKKJM    No churn
7832-POPKP       churn
6348-TACGU    No churn
Name: churn_prediction, dtype: object


Summary:

In week-5 woked on automation tehcniques, imported pandas as pd and loaded prepared data from week-2 it is named as prepped_churn_data.csv. To use pycart with automl installed pycaret,Then imported setup, compare_models, predict_model, save_model, load_model from pycaret.classification setup the automl and checked if the datatypes of the input data are correct or not.run the automl to find out the best model here i used compared model to get the best model. It looks like my best model is lr, closely followed by some others.every time i run it is changing as the accuracy scores are so similar between models.selected the last row, but using the indexing [-2:-1] to make it a 2D when i used predict model i got the best score as 0.5482.here i want to save my best model i used save and load method for that imported pickel and dumped the best model into pickel then loaded the best model and the pycaret model is saved used predictions to test the data is working properly or not.used visual studio code to import predict model and load model imported code from IPython.display. so the true values are (1,0,0,1,0) our model is working properly.