# Modeling

- Key points concerning the dataset
1. Dataset is imbalanced
2. Few outliers are present
3. Multicollinearity observed between certain variables

- We can use tree based algortihms such as Random Forest or XGBoost since they are robust at handling outliers, multicollinearity, feature selection and doesn't necessarily require feature scaling. However, we have already performed feature selection through EDA.
- With trees we don't have to encode any categorical variables.

In [12]:
import pandas as pd
import sklearn
import xgboost

In [13]:
df = pd.read_csv('../data/processed/Data_Science_Challenge.csv')

In [14]:
X = df.drop(['churn'], axis = 1)
y = df[['churn']]

In [15]:
X = df[['international plan', 'voice mail plan', 'number vmail messages',
       'total day minutes', 'total day charge',
       'total eve minutes', 'total eve charge',
       'total night minutes', 'total night charge',
       'total intl minutes', 'total intl charge',
       'customer service calls']]

In [16]:
print(X)
print(y)



     international plan voice mail plan  number vmail messages  \
0                    no             yes                     25   
1                    no             yes                     26   
2                    no              no                      0   
3                   yes              no                      0   
4                   yes              no                      0   
...                 ...             ...                    ...   
3328                 no             yes                     36   
3329                 no              no                      0   
3330                 no              no                      0   
3331                yes              no                      0   
3332                 no             yes                     25   

      total day minutes  total day charge  total eve minutes  \
0                 265.1             45.07              197.4   
1                 161.6             27.47              195.5   
2              

In [17]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   international plan      3333 non-null   object 
 1   voice mail plan         3333 non-null   object 
 2   number vmail messages   3333 non-null   int64  
 3   total day minutes       3333 non-null   float64
 4   total day charge        3333 non-null   float64
 5   total eve minutes       3333 non-null   float64
 6   total eve charge        3333 non-null   float64
 7   total night minutes     3333 non-null   float64
 8   total night charge      3333 non-null   float64
 9   total intl minutes      3333 non-null   float64
 10  total intl charge       3333 non-null   float64
 11  customer service calls  3333 non-null   int64  
dtypes: float64(8), int64(2), object(2)
memory usage: 312.6+ KB


In [18]:
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   churn   3333 non-null   bool 
dtypes: bool(1)
memory usage: 3.4 KB


In [19]:
cols = ['international plan', 'voice mail plan']
for col in cols:
    X[col] = X[col].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('category')


In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99, stratify=y)


In [21]:
from xgboost import XGBClassifier
classifier = XGBClassifier(
    scale_pos_weight = len(y_train[y_train==0]) / len(y_train[y_train==1]),
    random_state = 99,
    eval_metric = 'logloss',
    enable_categorical = True
)

In [22]:

classifier.fit(X_train, y_train)


In [23]:
y_pred = classifier.predict(X_test)

# Predict probabilities
y_pred_prob = classifier.predict_proba(X_test)[:, 1]


In [25]:
print(y_pred_prob)

[9.93352652e-01 1.11938838e-03 6.84485678e-03 9.92439985e-01
 3.98332719e-03 4.93897009e-04 1.03630882e-03 3.07270527e-01
 9.69980061e-01 8.11768696e-03 4.93283500e-04 9.95776236e-01
 9.98657942e-01 2.33292417e-03 4.38960362e-03 3.03598284e-03
 4.21049744e-02 3.62873106e-04 8.45715225e-01 2.24705134e-02
 8.41421913e-03 3.59271816e-03 1.02060393e-03 4.21738077e-04
 8.44084099e-03 3.69219662e-04 3.90359759e-03 5.43883443e-03
 9.90374446e-01 1.83623401e-03 2.77882461e-02 5.90303652e-02
 4.19454947e-02 7.92575069e-03 1.56094329e-02 9.49681640e-01
 1.41493324e-03 4.10567690e-03 9.90255535e-01 7.41532743e-02
 6.08091876e-02 1.88230153e-03 5.04603551e-04 1.60299765e-03
 2.32295394e-02 1.05052697e-03 1.40122278e-03 2.54634619e-02
 9.86981571e-01 4.55081183e-03 4.54933033e-04 9.32141006e-01
 2.81717442e-03 1.21434883e-03 5.37176791e-04 2.77669467e-02
 1.32257142e-03 1.56127941e-02 1.15141664e-02 4.37152712e-03
 7.54970266e-03 1.51261821e-01 2.81606652e-02 4.14775219e-03
 3.71318753e-03 1.422969