# Library and the DataFrame

In [1]:
import pandas as pd

In [2]:
uri = '/content/drive/MyDrive/Churn/churn.csv'

In [3]:
df = pd.read_csv(uri)

# Handling the DataFrame



In [4]:
df.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [5]:
df.drop(columns=['Surname','RowNumber','CustomerId'],inplace=True)

In [6]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [7]:
df.Geography.unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [8]:
df.Gender.unique()

array(['Female', 'Male'], dtype=object)

In [9]:
df = pd.concat([df,pd.get_dummies(df['Geography'],prefix='Country')],axis=1)

In [10]:
df.Gender = df['Gender'].map({'Female':1, 'Male':0})

In [11]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Country_France,Country_Germany,Country_Spain
0,619,France,1,42,2,0.0,1,1,1,101348.88,1,1,0,0
1,608,Spain,1,41,1,83807.86,1,0,1,112542.58,0,0,0,1
2,502,France,1,42,8,159660.8,3,1,0,113931.57,1,1,0,0
3,699,France,1,39,1,0.0,2,0,0,93826.63,0,1,0,0
4,850,Spain,1,43,2,125510.82,1,1,1,79084.1,0,0,0,1


In [12]:
df.drop(columns=['Geography'],inplace=True)

In [13]:
df.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Country_France,Country_Germany,Country_Spain
0,619,1,42,2,0.0,1,1,1,101348.88,1,1,0,0
1,608,1,41,1,83807.86,1,0,1,112542.58,0,0,0,1
2,502,1,42,8,159660.8,3,1,0,113931.57,1,1,0,0
3,699,1,39,1,0.0,2,0,0,93826.63,0,1,0,0
4,850,1,43,2,125510.82,1,1,1,79084.1,0,0,0,1


#Explaining the Data Frame

> ### CreditScore
> ***

We obtain the score value for each individual.

> ### Gender
> ***
|Code|Description|
|---|---|
|0|Male|
|1|Female|

> ### Age
> ***

Age of each customer.

> ### Tenure
> ***

Indicates how long a customer has maintained an account with the bank.

> ### Country_France |	Country_Germany	|Country_Spain
> ***
In these three columns, it indicates whether the customer is from the specified country or not.
> ***

|Code|Description|
|---|---|
|1|Yes|
|0|No|

> ### Exited
> ***
Indicates whether the customer has stopped being our client or not.
> ***
|Code|Desciption|
|---|---|
|1|Yes|
|0|No|

> ### EstimatedSalary
> ***
Estimate of the customer's salary.

> ### Balance
> ***

Represents the financial balance of a customer in an account."

> ### HasCrCard
> ***

Indicates whether the customer has a credit card or not.
> ***
|Code|Desciption|
|---|---|
|1|Yes|
|0|No|

> ### NumOfProducts
> ***

Indicates the number of products that the customer owns.

> ### IsActiveMember
> ***

Indicates whether the customer is an active member or not.
> ***
|Code|Desciption|
|---|---|
|1|Yes|
|0|No|

# Machine Learning Algorithm

Splitting the data

In [14]:
from sklearn.model_selection import train_test_split
from numpy import random

In [15]:
x = df.drop(columns=['Exited'])
y = df[['Exited']]

In [16]:
SEED = 1234
random.seed(SEED)

x_train,x_test,y_train,y_test = train_test_split(x,y,
                                                 test_size=0.25,
                                                 stratify = y)

In [17]:
print('Training with %d elements and testing with %d elements'%(len(x_train),len(x_test)))

Training with 7500 elements and testing with 2500 elements


## LinearSVC

In [18]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

Training the model

In [19]:
model = LinearSVC()
model.fit(x_train,y_train)

  y = column_or_1d(y, warn=True)


Predicting

In [20]:
p = model.predict(x_test)

Accuracy

In [21]:
print('The accuracy for this model is -> %.2f%%'%(accuracy_score(y_test,p)*100))

The accuracy for this model is -> 70.76%


**NORMALIZING DATA TO IMPROVE PERFORMANCE**




In [22]:
from sklearn.preprocessing import StandardScaler

In [23]:
scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [24]:
model = LinearSVC()
model.fit(x_train_scaled,y_train)

  y = column_or_1d(y, warn=True)


In [25]:
p_n = model.predict(x_test_scaled)

In [47]:
print('The accuracy for the NORMALIZED MODEL is -> %.2f%%'%(accuracy_score(y_test,p_n)*100))

The accuracy for the NORMALIZED MODEL is -> 80.68%


**Tuning Hyperparameters**

In [27]:
from sklearn.model_selection import GridSearchCV

In [28]:
param_grid = {'C':[0.001,0.01,0.1,1,10,1000]}

In [29]:
grid_search = GridSearchCV(LinearSVC(),param_grid,cv=5)

In [None]:
grid_search.fit(x_train_scaled,y_train)

In [31]:
best_model = grid_search.best_estimator_

In [32]:
p_b = best_model.predict(x_test_scaled)

In [48]:
print('The accuracy for the best model is -> %.2f%%' % (accuracy_score(y_test, p_b) * 100))


The accuracy for the best model is -> 81.00%


Conseguimos melhorar a acurácia com nosso último modelo. Após normalizar os dados e ajustar os hiperparâmetros, alcançamos uma acurácia de 81,00%.
***
Em outras palavras, em 81,00% dos casos, fomos capazes de prever com sucesso se nosso cliente permaneceria ou não.

Além da Acurácia, utilizaremos o F1 Score, pois essa métrica é o equilíbrio entre Recall e Precision, ou seja, busca maximizar o equilíbrio entre falsos positivos e falsos negativos.

***
***
We were able to improve the accuracy with our latest model. After normalizing the data and adjusting the hyperparameters, we achieved an accuracy of 81.00%.
***
In other words, in 81.00% of the cases, we successfully predicted whether our client would stay or not.



In addition to Accuracy, we will use the F1 Score, as this metric represents the balance between Recall and Precision, aiming to maximize the equilibrium between false positives and false negatives.

In [34]:
from sklearn.metrics import f1_score

In [54]:
print('The result for the F1-score in this model is -> %.2f%%' % (f1_score(y_test, p_b) * 100))

The result for the F1-score in this model is -> 28.14%



***
#**Como o modelo LinearSVC não lida bem com o desequilíbrio de classes, não conseguiu nos retornar um valor aceitável no F1-Score. Portanto, testaremos outros modelos, como Árvore de Decisão e Random Forest, que lidam melhor com esses casos.**
***

***
#**As the LinearSVC model does not handle class imbalance well, it failed to provide an acceptable value for the F1-Score. Therefore, we will test other models, such as Decision Tree and Random Forest, which perform better in such scenarios.**
***

# Random Forest

In [36]:
from sklearn.ensemble import RandomForestClassifier

Training the model and instantiating the model.








In [37]:
classifier = RandomForestClassifier(n_estimators=100)
classifier.fit(x_train,y_train)

  classifier.fit(x_train,y_train)


Accuracy

In [49]:
print('The accuracy for Random Forest is -> %.2f%%'%(classifier.score(x_test,y_test)*100))

The accuracy for Random Forest is -> 86.32%


Já conseguimos notar a melhora no modelo. Utilizando o modelo Random Forest, acertamos o Churn em 86,32% dos casos.
***
***
We have already noticed an improvement in the model. Using the Random Forest model, we correctly predicted Churn in 86.32% of the cases.

F1-Score

In [39]:
from sklearn.metrics import f1_score

Predicting

In [40]:
p_r = classifier.predict(x_test)

In [50]:
print('The result for the F1-score in this model is -> %.2f%%' % (f1_score(y_test,p_r ) * 100))

The result for the F1-score in this model is -> 58.80%


A interpretação de 58,80% indica um equilíbrio entre precisão e recall, o que é especialmente importante quando lidamos com desequilíbrio de classes.
***
***
The interpretation of 58.80% indicates a balance between precision and recall, which is particularly important when dealing with class imbalance.

##Decision Tree


In [42]:
from sklearn.tree import DecisionTreeClassifier

Instantiating and training the model.







In [43]:
dtc = DecisionTreeClassifier(criterion='entropy',
                             random_state = 42)
dtc.fit(x_train,y_train)

Predicting

In [44]:
p_tree = dtc.predict(x_test)

Accuracy

In [52]:
print('The accuracy for the Decision Tree  is -> %.2f%%' % (accuracy_score(y_test, p_tree) * 100))


The accuracy for the Decision Tree  is -> 78.00%


F1-Score

In [55]:
print('The result for the F1-score in this model is-> %.2f%%' % (f1_score(y_test,p_tree ) * 100))

The result for the F1-score in this model is-> 46.08%


**O modelo Árvore de Decisão, embora tenha apresentado um aumento no F1-Score em relação ao LinearSVC, foi menos assertivo que o Random Forest em ambas as métricas.**
***
***

The Decision Tree model, despite showing an improvement in F1-Score compared to LinearSVC, was less assertive than the Random Forest in both metrics.

## Cálculo das métricas por modelos
***
**LinearSVC:**
*   Acurácia -> 81.00%
*   F1-Score -> 28.14%
***
**Random Forest:**
*   Acurácia -> 86.32%
*   F1-Score -> 58.80%
***
**Árvore de Decisão:**
*   Acurácia -> 78.00%
*   F1-Score -> 46.08%
***
***

## Calculation of metrics by models
***
**LinearSVC:**
*   Accuracy -> 81.00%
*   F1-Score -> 28.14%
***
**Random Forest:**
*   Accuracy -> 86.32%
*   F1-Score -> 58.80%
***
**Árvore de Decisão:**
*   Accuracy -> 78.00%
*   F1-Score -> 46.08%
***

# Agora conseguimos chegar à conclusão de que o melhor modelo para este caso é o Random Forest, pois nos proporciona um acerto, ou seja, uma acurácia em 86,32% dos casos. Sabendo que seu equilíbrio em acertos tem uma larga vantagem em relação aos outros, com um F1-Score de 58,80%.

***
***


#Now we have come to the conclusion that the best model for this case is the Random Forest, as it provides us with accuracy, meaning 86.32% accuracy in cases. Knowing that its balance in accuracy has a significant advantage over others, with an F1-Score of 58.80%.