<a href="https://colab.research.google.com/github/Ninadrmore1999/ML-projects-/blob/main/Telecom_customer_churn_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's load the data of our business case now

In [None]:
#Churn prediction in telecom.
import numpy as np
import matplotlib.pyplot as plt

In [None]:

!gdown 1uUt7uL-VuF_5cpodYRiriEwhsldeEp3m

In [None]:
import pandas as pd
churn = pd.read_csv("churn_logistic.csv")
churn.head()

In [None]:
cols = ['Day Mins', 'Eve Mins', 'Night Mins', 'CustServ Calls', 'Account Length']
y = churn["Churn"]
X = churn[cols]
X.shape

Let's split the data into training, validation and testing




In [None]:
from sklearn.model_selection import train_test_split

X_tr_cv, X_test, y_tr_cv, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_tr_cv, y_tr_cv, test_size=0.25,random_state=1)
X_train.shape

We will scale our data before fitting the model

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

In [None]:
X_train

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
model.coef_

In [None]:
model.intercept_

In [None]:
model.predict(X_train)

## Accuracy Metric

Let's implement our accuracy metric now

In [None]:
def accuracy(y_true, y_pred):
  return np.sum(y_true==y_pred)/y_true.shape[0]

In [None]:
accuracy(y_train, model.predict(X_train))

In [None]:
accuracy(y_val, model.predict(X_val))

So our model has a validation accuracy of 0.71, or 71.49%

##**Hyperparameter tuning**


Hence let's start doing hyper parameter tuning on parameter $C = \frac{1}{\lambda}$  to increase the performance of the model

In [None]:
from sklearn.pipeline import make_pipeline
train_scores = []
val_scores = []
scaler = StandardScaler()
for la in np.arange(0.01, 5000.0, 100): # range of values of Lambda
  scaled_lr = make_pipeline(scaler, LogisticRegression(C=1/la))
  scaled_lr.fit(X_train, y_train)
  train_score = accuracy(y_train, scaled_lr.predict(X_train))
  val_score = accuracy(y_val, scaled_lr.predict(X_val))
  train_scores.append(train_score)
  val_scores.append(val_score)

Now, let's plot the graph and pick the Regularization Parameter $λ$ which gives the best validation score

In [None]:
plt.figure(figsize=(10,5))
plt.plot(list(np.arange(0.01, 5000.0, 100)), train_scores, label="train")
plt.plot(list(np.arange(0.01, 5000.0, 100)), val_scores, label="val")
plt.legend(loc='lower right')

plt.xlabel("Regularization Parameter(λ)")
plt.ylabel("Accuracy")
plt.grid()
plt.show()


- We see how Validation increases to a peak and then decreases

- Notice as Regularization is increasing, the Accuracy decreasing since model is moving towards Underfit

Let's take lambda value as 1000 for this data and check the
results

In [None]:
model = LogisticRegression(C=1/1000)
model.fit(X_train, y_train)

In [None]:
accuracy(y_train, model.predict(X_train))

In [None]:
accuracy(y_val, model.predict(X_val))

We can observe an increase of 0.01, or 1%, in both training and validation data

Let's check our model for test data too

In [None]:
accuracy(y_test, model.predict(X_test))

### Sklearn Code implementation for MultiClass Classification

Importing libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.inspection import DecisionBoundaryDisplay

Creating some data with multiple classes

In [None]:
# dataset creation with 3 classes
from sklearn.datasets import make_classification

X, y = make_classification(n_samples= 498,
                           n_features= 2,
                           n_classes = 3,
                           n_redundant=0,
                           n_clusters_per_class=1,
                           random_state=5)
y=y.reshape(len(y), 1)

print(X.shape, y.shape)

Plotting the data

In [None]:
plt.scatter(X[:, 0], X[:, 1], c = y)
plt.show()


Splitting the data into train validation and test set

In [None]:
from sklearn.model_selection import train_test_split

X_tr_cv, X_test, y_tr_cv, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
X_train, X_val, y_train, y_val = train_test_split(X_tr_cv, y_tr_cv, test_size=0.25,random_state=4)
X_train.shape

training the OneVsRest Logistic Regression model

In [None]:
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X_train, y_train)

Checking the Accuracy of Training, validation and Test dataset

In [None]:
print(f'Training Accuracy:{model.score(X_train,y_train)}')
print(f'Validation Accuracy :{model.score(X_val,y_val)}')
print(f'Test Accuracy:{model.score(X_test,y_test)}')

Creating Hyperplane of OVR LogisticRegression for the entire data

In [None]:
X

In [None]:
_, ax = plt.subplots()
DecisionBoundaryDisplay.from_estimator(model, X, response_method="predict", cmap=plt.cm.Paired, ax=ax)
plt.title("Decision surface of LogisticRegression")
plt.axis("tight")

# Plot also the training points
colors = "bry"
for i, color in zip(model.classes_, colors):
        idx = np.where(y == i)
        plt.scatter(
            X[idx, 0], X[idx, 1], c=color, cmap=plt.cm.Paired, edgecolor="black", s=20
        )


# Plot the three one-against-all classifiers
xmin, xmax = plt.xlim()
ymin, ymax = plt.ylim()
coef = model.coef_
intercept = model.intercept_

def plot_hyperplane(c, color):
        def line(x0):
            return (-(x0 * coef[c, 0]) - intercept[c]) / coef[c, 1]

        plt.plot([xmin, xmax], [line(xmin), line(xmax)], ls="--", color=color)

for i, color in zip(model.classes_, colors):
        plot_hyperplane(i, color)

plt.show()

**Observe**

We can see how One-vs-Rest Logistic Regression is able to classify Multi-class Classification data

Lets Load the data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
!gdown 1CgBW5H54YfdYtJmYE5GWctaHZSpFt71V

In [None]:
demo1 = pd.read_csv('spam_ham_dataset.csv')
demo1.drop(['Unnamed: 0','label'],axis=1,inplace=True)
demo1.head()

In [None]:
!gdown 1dw56R8SzKgTgiKurfBLUTxmiewJacMkt

In [None]:
dt = pd.read_csv('Spam_finalData.csv')

In [None]:
dt.head()

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(dt.drop(['label_num'],axis=1),dt['label_num'])

In [None]:
y_test.value_counts().plot(kind='bar')
plt.xlabel('Class')
plt.ylabel('Number of Samples')
plt.title('Test Data Distribution')
plt.show()

In [None]:
from sklearn.linear_model import LogisticRegression


model = LogisticRegression()
model.fit(X_train,y_train)


In [None]:
print('Model Accuracy:',model.score(X_test,y_test))

# **Confusion Matrix Code**

#### Lets use sklearn `confusion_matrix` function to get the values

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_pred = model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix # 2D np array

But the `ConfusionMatrixDisplay` plotting functionality in sklearn makes this easy

In [None]:
from matplotlib import pyplot as plt

In [None]:
# ax used here to control the size of confusion matrix
fig, ax = plt.subplots(figsize=(5,5))
ConfusionMatrixDisplay(conf_matrix).plot(ax = ax)

Finding Accuracy using Confusion Matrix

In [None]:
np.diag(conf_matrix).sum() / conf_matrix.sum()

# **Precision Code**

Scratch Implementation

In [None]:
def precision_calc(conf):
  tp = conf[1,1]
  fp = conf[0,1]

  return tp/(tp+fp)

In [None]:
precision_calc(conf_matrix)

Using Sklearn's precision Score

In [None]:
from sklearn.metrics import precision_score

precision_score(y_test, y_pred)

**observe**

Even though the model has a lower precision value than accuracy:
- Its still a great model because of its high precision value

# **Recall Code**

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
ConfusionMatrixDisplay(conf_matrix).plot(ax = ax)

Scratch Implementation

In [None]:
def recall_calc(conf):
  tp = conf[1,1]
  fn = conf[1,0]

  return tp/(tp+fn)

In [None]:
recall_calc(conf_matrix)

Using Sklearn's precision Score

In [None]:
from sklearn.metrics import recall_score

recall_score(y_test, y_pred)

**observe**

The model's recall value is almost very close to accuracy :
- It shows the  model has very low FN

### F1-Score

In [None]:
ConfusionMatrixDisplay(conf_matrix).plot()

scratch implementation

In [None]:
pre = precision_calc(conf_matrix)
re = precision_calc(conf_matrix)

f1 = 2* (pre*re)/(pre+re+1e-6)

print(f'f1Score:{f1}')

In [None]:
from sklearn.metrics import f1_score

In [None]:
print(f'f1Score:{f1_score(y_test,y_pred)}')

**observe**

Clearly our model is a very decent one:
- Cause even after imbalance data
- the model f1 score is great.

The difference in scratch implementation and Sklearn f1score:
- Because Sklearn uses a different value to counter zero division

# Spam vs Non-Spam: Business Case



You are working in Google and have a task to create an Email spam detection model

Here,
- **not spam** → Class 0
- **spam** → Class 1

<br>




**Note:** For simplicity, lets call:
-  Class 0 **Not Spam** as Negative Class
- and Class 1 **Spam** as Positive Class



Lets Load the data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

!gdown 1dw56R8SzKgTgiKurfBLUTxmiewJacMkt

dt = pd.read_csv('Spam_finalData.csv')




X_train,X_test,y_train,y_test = train_test_split(dt.drop(['label_num'],axis=1),dt['label_num'])

y_test.value_counts().plot(kind='bar')
plt.xlabel('Class')
plt.ylabel('Number of Samples')
plt.title('Test Data Distribution')
plt.show()


print(f'Training Data:{X_train.shape},{y_train.shape}, Testing Data: {X_test.shape},{y_test.shape}')





model = LogisticRegression()
model.fit(X_train,y_train)



# **AU-ROC curve Code**

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score


stores model probabilities

In [None]:
probability = model.predict_proba(X_test)

In [None]:
probability

**Observe**

```Probability``` variable contains 2 probability $P(Y=1 |X)$ and $P(Y=0 |X )$

#### But for thresholding we need only one probability, what can be done ?

Ans: lets consider only $ p = P(Y=1 |X) $




In [None]:
probabilites = probability[:,1]

In [None]:
fpr, tpr, thr = roc_curve(y_test,probabilites)

In [None]:
plt.plot(fpr,tpr)

#random model
plt.plot(fpr,fpr,'--',color='red' )
plt.title('ROC curve')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

In [None]:
# AUC
roc_auc_score(y_test,probabilites)

# **Precision Recall curve**

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc

In [None]:
precision, recall, thr = precision_recall_curve(y_test, probabilites)

In [None]:
plt.plot(recall, precision)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('PR curve')
plt.show()

In [None]:
auc(recall, precision)

**observe**

Now the **AU-PRC** comes close to F1 score
- Showing that **PRC** worked just fine in imbalanced data




## **Class weight Code**


Lets now see how its implemented in Sklearn for Logisitic Regression:




In [None]:
y_train.value_counts().plot(kind='bar')
plt.xlabel('Class')
plt.ylabel('Number of Samples')
plt.title('Train Data Distribution')
plt.show()

**observe**

The training data:
- Non-spam data = 2727
- Spam data = 1151

Hence weightage parameter becomes:
- $W_i = \frac{2727}{1151} = 2.37$

In [None]:
# Model creation, prediction

def training(model,X_train,y_train,X_test,y_test):

  model.fit(X_train, y_train)

  train_y_pred = model.predict(X_train)
  test_y_pred = model.predict(X_test)

  train_score = f1_score(y_train, train_y_pred)
  test_score = f1_score(y_test, test_y_pred)

  return train_score,test_score


In [None]:
# minority class needs more re-weighting


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

model = LogisticRegression(class_weight={0:1,1:2.37})

f1_train,f1_test = training(model,X_train,y_train,X_test,y_test)
print(f'Training F1 score:{f1_train}, Testing F1 score:{f1_test}')

**Observe**

how introducing Weighted-loss,
- did not do much change in F1-score

<br>


#### What can be the reason ?
Ans: lets check the confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_pred = model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)

ConfusionMatrixDisplay(conf_matrix).plot()

**Observe**

Clearly, by introducing Class weights,
- Model has predicted many Non-Spam emails as Spam ($FP ⇑$)
- Hence the F1 score is low

#**Oversampling code**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from imblearn.over_sampling import RandomOverSampler

# Create an instance of RandomOverSampler
oversampler = RandomOverSampler()

# Perform oversampling on the training data
print('Before Oversampling')
print(y_train.value_counts())
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)

print('After Oversampling')
print(y_train_oversampled.value_counts())

model = LogisticRegression()

f1_train,f1_test = training(model,X_train_oversampled, y_train_oversampled,X_test,y_test)

print(f'Training F1 score:{f1_train}, Testing F1 score:{f1_test}')

**Observe**

Training F1 Score is much higher than testing F1 Score

<br>

#### What can be said when training performance > testing performance ?

Ans: Model Overfits
- This means if we add same repitive samples of minority class, **it can lead to overfitting**




#### Why does model overfits in oversampling technique ?



Ans: because oversampling just **repeats samples**
- This makes the model to over learn patterns

<br>

#### What can be a smarter approach for oversampling ?
Ans: Instead of repeating the samples:
- Lets create **synthetically new samples** for our minority class label

- This approach will provide new samples to the model so it does not over learns any patterns

<br>




# **SMOTE (Synthetically Minority Oversampling Technique)**


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

# Create an instance of SMOTE
smt = SMOTE()


# Perform SMOTE on the training data
print('Before SMOTE')
print(y_train.value_counts())

X_sm, y_sm = smt.fit_resample(X_train, y_train)
print('After Oversampling')
print(y_train_oversampled.value_counts())

model = LogisticRegression(C= 5, penalty= 'l1', solver = 'liblinear')

f1_train,f1_test = training(model,X_sm, y_sm,X_test,y_test)

print(f'Training F1 score:{f1_train}, Testing F1 score:{f1_test}')

