# Context :This dataset is part of a data science project focused on customer churn prediction for a subscription-based service. Customer churn, the rate at which customers cancel their subscriptions, is a vital metric for businesses offering subscription services. Predictive analytics techniques are employed to anticipate which customers are likely to churn, enabling companies to take proactive measures for customer retention.

# Content: This dataset contains anonymized information about customer subscriptions and their interaction with the service. The data includes various features such as subscription type, payment method, viewing preferences, customer support interactions, and other relevant attributes. It consists of three files such as "test.csv", "train.csv", "data_descriptions.csv".

In [1]:
#Importing necessary libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
train_data=pd.read_csv(r'E:\ML datasets\train.csv')
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243787 entries, 0 to 243786
Data columns (total 21 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   AccountAge                243787 non-null  int64  
 1   MonthlyCharges            243787 non-null  float64
 2   TotalCharges              243787 non-null  float64
 3   SubscriptionType          243787 non-null  object 
 4   PaymentMethod             243787 non-null  object 
 5   PaperlessBilling          243787 non-null  object 
 6   ContentType               243787 non-null  object 
 7   MultiDeviceAccess         243787 non-null  object 
 8   DeviceRegistered          243787 non-null  object 
 9   ViewingHoursPerWeek       243787 non-null  float64
 10  AverageViewingDuration    243787 non-null  float64
 11  ContentDownloadsPerMonth  243787 non-null  int64  
 12  GenrePreference           243787 non-null  object 
 13  UserRating                243787 non-null  f

In [3]:
test_data=pd.read_csv(r'E:\ML datasets\test.csv')
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104480 entries, 0 to 104479
Data columns (total 20 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   AccountAge                104480 non-null  int64  
 1   MonthlyCharges            104480 non-null  float64
 2   TotalCharges              104480 non-null  float64
 3   SubscriptionType          104480 non-null  object 
 4   PaymentMethod             104480 non-null  object 
 5   PaperlessBilling          104480 non-null  object 
 6   ContentType               104480 non-null  object 
 7   MultiDeviceAccess         104480 non-null  object 
 8   DeviceRegistered          104480 non-null  object 
 9   ViewingHoursPerWeek       104480 non-null  float64
 10  AverageViewingDuration    104480 non-null  float64
 11  ContentDownloadsPerMonth  104480 non-null  int64  
 12  GenrePreference           104480 non-null  object 
 13  UserRating                104480 non-null  f

In [73]:
#Shape
train_data.shape

(243787, 21)

In [74]:
#Columns
train_data.columns

Index(['AccountAge', 'MonthlyCharges', 'TotalCharges', 'SubscriptionType',
       'PaymentMethod', 'PaperlessBilling', 'ContentType', 'MultiDeviceAccess',
       'DeviceRegistered', 'ViewingHoursPerWeek', 'AverageViewingDuration',
       'ContentDownloadsPerMonth', 'GenrePreference', 'UserRating',
       'SupportTicketsPerMonth', 'Gender', 'WatchlistSize', 'ParentalControl',
       'SubtitlesEnabled', 'CustomerID', 'Churn'],
      dtype='object')

In [75]:
#Head
train_data.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
0,20,11.055215,221.104302,Premium,Mailed check,No,Both,No,Mobile,36.758104,...,10,Sci-Fi,2.176498,4,Male,3,No,No,CB6SXPNVZA,0
1,57,5.175208,294.986882,Basic,Credit card,Yes,Movies,No,Tablet,32.450568,...,18,Action,3.478632,8,Male,23,No,Yes,S7R2G87O09,0
2,73,12.106657,883.785952,Basic,Mailed check,Yes,Movies,No,Computer,7.39516,...,23,Fantasy,4.238824,6,Male,1,Yes,Yes,EASDC20BDT,0
3,32,7.263743,232.439774,Basic,Electronic check,No,TV Shows,No,Tablet,27.960389,...,30,Drama,4.276013,2,Male,24,Yes,Yes,NPF69NT69N,0
4,57,16.953078,966.325422,Premium,Electronic check,Yes,TV Shows,No,TV,20.083397,...,20,Comedy,3.61617,4,Female,0,No,No,4LGYPK7VOL,0


In [76]:
#Test
test_data.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID
0,38,17.869374,679.036195,Premium,Mailed check,No,TV Shows,No,TV,29.126308,122.274031,42,Comedy,3.522724,2,Male,23,No,No,O1W6BHP6RM
1,77,9.912854,763.289768,Basic,Electronic check,Yes,TV Shows,No,TV,36.873729,57.093319,43,Action,2.021545,2,Female,22,Yes,No,LFR4X92X8H
2,5,15.019011,75.095057,Standard,Bank transfer,No,TV Shows,Yes,Computer,7.601729,140.414001,14,Sci-Fi,4.806126,2,Female,22,No,Yes,QM5GBIYODA
3,88,15.357406,1351.451692,Standard,Electronic check,No,Both,Yes,Tablet,35.58643,177.002419,14,Comedy,4.9439,0,Female,23,Yes,Yes,D9RXTK2K9F
4,91,12.406033,1128.949004,Standard,Credit card,Yes,TV Shows,Yes,Tablet,23.503651,70.308376,6,Drama,2.84688,6,Female,0,No,No,ENTCCHR1LR


In [77]:
#Removing unnecessary features
train_data.drop(columns={"CustomerID"},inplace=True)
test_data.drop(columns={"CustomerID"},inplace=True)

In [78]:
#Finding Categorical  Columns using for loop and if statement where data types == "object" and then using print The expression {len(cat_col)} is a placeholder that will be replaced by the value of the len function applied to the cat_col variable. The len function returns the number of elements in a list, and the cat_col variable is a list of the names of the categorical columns in the dataset. Therefore, this expression will show the number of categorical columns in the dataset.
#The f before the string indicates that it is a formatted string that can contain placeholders and expressions inside curly braces. The f-string format is a convenient way to create strings that include variables and calculations.
cat_col=[]
for col  in test_data.columns:
    if(test_data[col].dtypes=='object'):
        cat_col.append(col)
print(f"There are total {len(cat_col)} categorical columns in datset")
print(cat_col)

There are total 10 categorical columns in datset
['SubscriptionType', 'PaymentMethod', 'PaperlessBilling', 'ContentType', 'MultiDeviceAccess', 'DeviceRegistered', 'GenrePreference', 'Gender', 'ParentalControl', 'SubtitlesEnabled']


In [79]:
#Finding Numerical Columns
num_col=[]
for col in test_data.columns:
    if(test_data[col].dtypes!='object'):
        num_col.append(col)
print(f"There are total {len(num_col)} numerical columns in dataset")
print(num_col)

There are total 9 numerical columns in dataset
['AccountAge', 'MonthlyCharges', 'TotalCharges', 'ViewingHoursPerWeek', 'AverageViewingDuration', 'ContentDownloadsPerMonth', 'UserRating', 'SupportTicketsPerMonth', 'WatchlistSize']


# data cleaning and preprocessing

In [80]:
#sns.pairplot(train_data,hue='Churn')

In [81]:
#Finding Null Values in Training Data
train_data.isnull().mean()*100

AccountAge                  0.0
MonthlyCharges              0.0
TotalCharges                0.0
SubscriptionType            0.0
PaymentMethod               0.0
PaperlessBilling            0.0
ContentType                 0.0
MultiDeviceAccess           0.0
DeviceRegistered            0.0
ViewingHoursPerWeek         0.0
AverageViewingDuration      0.0
ContentDownloadsPerMonth    0.0
GenrePreference             0.0
UserRating                  0.0
SupportTicketsPerMonth      0.0
Gender                      0.0
WatchlistSize               0.0
ParentalControl             0.0
SubtitlesEnabled            0.0
Churn                       0.0
dtype: float64

In [82]:
#Finding NUll Values in Testing Data
test_data.isnull().mean()*100

AccountAge                  0.0
MonthlyCharges              0.0
TotalCharges                0.0
SubscriptionType            0.0
PaymentMethod               0.0
PaperlessBilling            0.0
ContentType                 0.0
MultiDeviceAccess           0.0
DeviceRegistered            0.0
ViewingHoursPerWeek         0.0
AverageViewingDuration      0.0
ContentDownloadsPerMonth    0.0
GenrePreference             0.0
UserRating                  0.0
SupportTicketsPerMonth      0.0
Gender                      0.0
WatchlistSize               0.0
ParentalControl             0.0
SubtitlesEnabled            0.0
dtype: float64

In [83]:

#Finding Duplicates in Train Data
train_data.duplicated().sum()

0

In [84]:
#Finding Duplicates in Test Data
test_data.duplicated().sum()

0

In [85]:
#Feature Encoding this sample method is used that randomly picks three rows form the dataset and shows their values for these columns. 
train_data[cat_col].sample(3)

Unnamed: 0,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,GenrePreference,Gender,ParentalControl,SubtitlesEnabled
213848,Premium,Electronic check,No,Both,Yes,Tablet,Fantasy,Male,Yes,Yes
88309,Standard,Mailed check,No,Both,No,TV,Comedy,Female,No,Yes
187787,Standard,Electronic check,No,Both,No,TV,Fantasy,Male,No,No


In [86]:
#Importing Required Libraries
from sklearn.preprocessing import LabelEncoder
for col in cat_col:
    print()
    print(f"{col}_encoder=LabelEncoder()")
    print(f"train_data['{col}']={col}_encoder.fit_transform(train_data['{col}'])")


SubscriptionType_encoder=LabelEncoder()
train_data['SubscriptionType']=SubscriptionType_encoder.fit_transform(train_data['SubscriptionType'])

PaymentMethod_encoder=LabelEncoder()
train_data['PaymentMethod']=PaymentMethod_encoder.fit_transform(train_data['PaymentMethod'])

PaperlessBilling_encoder=LabelEncoder()
train_data['PaperlessBilling']=PaperlessBilling_encoder.fit_transform(train_data['PaperlessBilling'])

ContentType_encoder=LabelEncoder()
train_data['ContentType']=ContentType_encoder.fit_transform(train_data['ContentType'])

MultiDeviceAccess_encoder=LabelEncoder()
train_data['MultiDeviceAccess']=MultiDeviceAccess_encoder.fit_transform(train_data['MultiDeviceAccess'])

DeviceRegistered_encoder=LabelEncoder()
train_data['DeviceRegistered']=DeviceRegistered_encoder.fit_transform(train_data['DeviceRegistered'])

GenrePreference_encoder=LabelEncoder()
train_data['GenrePreference']=GenrePreference_encoder.fit_transform(train_data['GenrePreference'])

Gender_encoder=LabelEncoder()


In [87]:
#Encoding on Training data
SubscriptionType_encoder=LabelEncoder()
train_data['SubscriptionType']=SubscriptionType_encoder.fit_transform(train_data['SubscriptionType'])

PaymentMethod_encoder=LabelEncoder()
train_data['PaymentMethod']=PaymentMethod_encoder.fit_transform(train_data['PaymentMethod'])

PaperlessBilling_encoder=LabelEncoder()
train_data['PaperlessBilling']=PaperlessBilling_encoder.fit_transform(train_data['PaperlessBilling'])

ContentType_encoder=LabelEncoder()
train_data['ContentType']=ContentType_encoder.fit_transform(train_data['ContentType'])
MultiDeviceAccess_encoder=LabelEncoder()
train_data['MultiDeviceAccess']=MultiDeviceAccess_encoder.fit_transform(train_data['MultiDeviceAccess'])

DeviceRegistered_encoder=LabelEncoder()
train_data['DeviceRegistered']=DeviceRegistered_encoder.fit_transform(train_data['DeviceRegistered'])

GenrePreference_encoder=LabelEncoder()
train_data['GenrePreference']=GenrePreference_encoder.fit_transform(train_data['GenrePreference'])

Gender_encoder=LabelEncoder()
train_data['Gender']=Gender_encoder.fit_transform(train_data['Gender'])

ParentalControl_encoder=LabelEncoder()
train_data['ParentalControl']=ParentalControl_encoder.fit_transform(train_data['ParentalControl'])
SubtitlesEnabled_encoder=LabelEncoder()
train_data['SubtitlesEnabled']=SubtitlesEnabled_encoder.fit_transform(train_data['SubtitlesEnabled'])

In [88]:
#After Encoding
train_data[cat_col].sample(5)

Unnamed: 0,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,GenrePreference,Gender,ParentalControl,SubtitlesEnabled
87132,1,3,1,0,1,3,4,1,1,0
202891,0,1,1,2,1,3,3,1,1,1
172350,0,1,1,1,0,2,4,0,1,1
8554,2,0,0,2,0,2,1,1,0,0
230209,0,1,1,2,1,0,3,0,1,0


In [89]:
#Encoding Categorical Columns of Testing Data ,In the last line prints a line of code that uses the transform method of the encoder object that corresponds to the column name to encode the values of the column in the test dataset. The encoder object is named as col_encoder, where col is the column name The transform method of the encoder object takes the original values of the column and converts them to numerical values based on the mapping that was learned from the train dataset.  
for col in cat_col:
    print()
    print(f"#{col}")
    print(f"test_data['{col}']={col}_encoder.transform(train_data['{col}'])")


#SubscriptionType
test_data['SubscriptionType']=SubscriptionType_encoder.transform(train_data['SubscriptionType'])

#PaymentMethod
test_data['PaymentMethod']=PaymentMethod_encoder.transform(train_data['PaymentMethod'])

#PaperlessBilling
test_data['PaperlessBilling']=PaperlessBilling_encoder.transform(train_data['PaperlessBilling'])

#ContentType
test_data['ContentType']=ContentType_encoder.transform(train_data['ContentType'])

#MultiDeviceAccess
test_data['MultiDeviceAccess']=MultiDeviceAccess_encoder.transform(train_data['MultiDeviceAccess'])

#DeviceRegistered
test_data['DeviceRegistered']=DeviceRegistered_encoder.transform(train_data['DeviceRegistered'])

#GenrePreference
test_data['GenrePreference']=GenrePreference_encoder.transform(train_data['GenrePreference'])

#Gender
test_data['Gender']=Gender_encoder.transform(train_data['Gender'])

#ParentalControl
test_data['ParentalControl']=ParentalControl_encoder.transform(train_data['ParentalControl'])

#SubtitlesEnabled
test_data['Su

# feature encoding on test data

In [90]:
#SubscriptionType
test_data['SubscriptionType']=SubscriptionType_encoder.transform(test_data['SubscriptionType'])

#PaymentMethod
test_data['PaymentMethod']=PaymentMethod_encoder.transform(test_data['PaymentMethod'])

#PaperlessBilling
test_data['PaperlessBilling']=PaperlessBilling_encoder.transform(test_data['PaperlessBilling'])
#ContentType
test_data['ContentType']=ContentType_encoder.transform(test_data['ContentType'])

#MultiDeviceAccess
test_data['MultiDeviceAccess']=MultiDeviceAccess_encoder.transform(test_data['MultiDeviceAccess'])

#DeviceRegistered
test_data['DeviceRegistered']=DeviceRegistered_encoder.transform(test_data['DeviceRegistered'])

#GenrePreference
test_data['GenrePreference']=GenrePreference_encoder.transform(test_data['GenrePreference'])
#Gender
test_data['Gender']=Gender_encoder.transform(test_data['Gender'])

#ParentalControl
test_data['ParentalControl']=ParentalControl_encoder.transform(test_data['ParentalControl'])

#SubtitlesEnabled
test_data['SubtitlesEnabled']=SubtitlesEnabled_encoder.transform(test_data['SubtitlesEnabled'])

In [91]:
#After Encoding
test_data[cat_col].sample(3)

Unnamed: 0,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,GenrePreference,Gender,ParentalControl,SubtitlesEnabled
10592,1,1,1,2,1,3,0,0,1,1
16511,2,2,0,2,1,1,4,1,0,1
64878,1,3,1,0,0,3,3,0,1,0


In [92]:
#Feature Scaling 
from sklearn.preprocessing import StandardScaler
for col in num_col:
    print()
    print(f"#{col}")
    print(f"{col}_scaler=StandardScaler()")
    print(f"train_data['{col}']={col}_scaler.fit_transform(np.array(train_data['{col}']).reshape(len(train_data['{col}']),1))")


#AccountAge
AccountAge_scaler=StandardScaler()
train_data['AccountAge']=AccountAge_scaler.fit_transform(np.array(train_data['AccountAge']).reshape(len(train_data['AccountAge']),1))

#MonthlyCharges
MonthlyCharges_scaler=StandardScaler()
train_data['MonthlyCharges']=MonthlyCharges_scaler.fit_transform(np.array(train_data['MonthlyCharges']).reshape(len(train_data['MonthlyCharges']),1))

#TotalCharges
TotalCharges_scaler=StandardScaler()
train_data['TotalCharges']=TotalCharges_scaler.fit_transform(np.array(train_data['TotalCharges']).reshape(len(train_data['TotalCharges']),1))

#ViewingHoursPerWeek
ViewingHoursPerWeek_scaler=StandardScaler()
train_data['ViewingHoursPerWeek']=ViewingHoursPerWeek_scaler.fit_transform(np.array(train_data['ViewingHoursPerWeek']).reshape(len(train_data['ViewingHoursPerWeek']),1))

#AverageViewingDuration
AverageViewingDuration_scaler=StandardScaler()
train_data['AverageViewingDuration']=AverageViewingDuration_scaler.fit_transform(np.array(train_data['AverageV

In [93]:
#Before Scaling Training Data
train_data[num_col].sample(3)

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,WatchlistSize
135118,11,14.412238,158.534622,11.147661,168.611255,10,2.336242,2,15
39539,101,5.738883,579.627174,36.267877,57.400333,3,3.719808,1,5
56790,18,13.801925,248.434651,7.672489,170.939005,25,2.481713,2,13


# It creates an instance of the StandardScaler class and assigns it to the AccountAge_scaler variable. The StandardScaler class is a tool that can transform the data such that it has a mean of zero and a standard deviation of one. This can help to improve the performance and accuracy of some machine learning algorithms that assume that the data is normally distributed.
#It selects for my all desirecolumn concept is same from the train_data variable, which is a pandas DataFrame object that contains the training data. all columns contains the values of the account age feature, which is a numerical variable that indicates how long the account has been active in days.
It converts the all columns to a numpy array using the np.array function. A numpy array is a data structure that can store multiple values of the same data type in a fixed-size grid. Numpy arrays are faster and more efficient than lists for numerical computations.
It reshapes the numpy array to have a two-dimensional shape using the reshape method. The reshape method takes two arguments: the number of rows and the number of columns. The code passes the length of the all desired columns as the number of rows and 1 as the number of columns. This means that the numpy array will have one column and as many rows as the AccountAge column. This is necessary because the StandardScaler class expects a two-dimensional array as input , It standardizes the numpy array using the fit_transform method for the all object columns. The fit_transform method does two things: it calculates the mean and standard deviation of the numpy array using the fit method, and it transforms the numpy array to have a mean of zero and a standard deviation of one using the transform method. The transformed numpy array will have the same shape as the original numpy array, but the values will be scaled according to the formula:
z = (x - mean) / std

where z is the standardized value, x is the original value, mean is the mean of the numpy array, and std is the standard deviation of the numpy array.

It assigns the standardized numpy array back to the all column in the train_data variable. This means that the original values of the all column will be replaced by the standardized values.

In [94]:
#AccountAge
AccountAge_scaler=StandardScaler()
train_data['AccountAge']=AccountAge_scaler.fit_transform(np.array(train_data['AccountAge']).reshape(len(train_data['AccountAge']),1))

#MonthlyCharges
MonthlyCharges_scaler=StandardScaler()
train_data['MonthlyCharges']=MonthlyCharges_scaler.fit_transform(np.array(train_data['MonthlyCharges']).reshape(len(train_data['MonthlyCharges']),1))

#TotalCharges
TotalCharges_scaler=StandardScaler()
train_data['TotalCharges']=TotalCharges_scaler.fit_transform(np.array(train_data['TotalCharges']).reshape(len(train_data['TotalCharges']),1))
#ViewingHoursPerWeek
ViewingHoursPerWeek_scaler=StandardScaler()
train_data['ViewingHoursPerWeek']=ViewingHoursPerWeek_scaler.fit_transform(np.array(train_data['ViewingHoursPerWeek']).reshape(len(train_data['ViewingHoursPerWeek']),1))

#AverageViewingDuration
AverageViewingDuration_scaler=StandardScaler()
train_data['AverageViewingDuration']=AverageViewingDuration_scaler.fit_transform(np.array(train_data['AverageViewingDuration']).reshape(len(train_data['AverageViewingDuration']),1))
#ContentDownloadsPerMonth
ContentDownloadsPerMonth_scaler=StandardScaler()
train_data['ContentDownloadsPerMonth']=ContentDownloadsPerMonth_scaler.fit_transform(np.array(train_data['ContentDownloadsPerMonth']).reshape(len(train_data['ContentDownloadsPerMonth']),1))

#UserRating
UserRating_scaler=StandardScaler()
train_data['UserRating']=UserRating_scaler.fit_transform(np.array(train_data['UserRating']).reshape(len(train_data['UserRating']),1))

#SupportTicketsPerMonth
SupportTicketsPerMonth_scaler=StandardScaler()
train_data['SupportTicketsPerMonth']=SupportTicketsPerMonth_scaler.fit_transform(np.array(train_data['SupportTicketsPerMonth']).reshape(len(train_data['SupportTicketsPerMonth']),1))
#WatchlistSize
WatchlistSize_scaler=StandardScaler()
train_data['WatchlistSize']=WatchlistSize_scaler.fit_transform(np.array(train_data['WatchlistSize']).reshape(len(train_data['WatchlistSize']),1))

In [95]:
#AccountAge
AccountAge_scaler=StandardScaler()
train_data['AccountAge']=AccountAge_scaler.fit_transform(np.array(train_data['AccountAge']).reshape(len(train_data['AccountAge']),1))

#MonthlyCharges
MonthlyCharges_scaler=StandardScaler()
train_data['MonthlyCharges']=MonthlyCharges_scaler.fit_transform(np.array(train_data['MonthlyCharges']).reshape(len(train_data['MonthlyCharges']),1))

#TotalCharges
TotalCharges_scaler=StandardScaler()
train_data['TotalCharges']=TotalCharges_scaler.fit_transform(np.array(train_data['TotalCharges']).reshape(len(train_data['TotalCharges']),1))

#ViewingHoursPerWeek
ViewingHoursPerWeek_scaler=StandardScaler()
train_data['ViewingHoursPerWeek']=ViewingHoursPerWeek_scaler.fit_transform(np.array(train_data['ViewingHoursPerWeek']).reshape(len(train_data['ViewingHoursPerWeek']),1))

#AverageViewingDuration
AverageViewingDuration_scaler=StandardScaler()
train_data['AverageViewingDuration']=AverageViewingDuration_scaler.fit_transform(np.array(train_data['AverageViewingDuration']).reshape(len(train_data['AverageViewingDuration']),1))

#ContentDownloadsPerMonth
ContentDownloadsPerMonth_scaler=StandardScaler()
train_data['ContentDownloadsPerMonth']=ContentDownloadsPerMonth_scaler.fit_transform(np.array(train_data['ContentDownloadsPerMonth']).reshape(len(train_data['ContentDownloadsPerMonth']),1))

#UserRating
UserRating_scaler=StandardScaler()
train_data['UserRating']=UserRating_scaler.fit_transform(np.array(train_data['UserRating']).reshape(len(train_data['UserRating']),1))

#SupportTicketsPerMonth
SupportTicketsPerMonth_scaler=StandardScaler()
train_data['SupportTicketsPerMonth']=SupportTicketsPerMonth_scaler.fit_transform(np.array(train_data['SupportTicketsPerMonth']).reshape(len(train_data['SupportTicketsPerMonth']),1))

#WatchlistSize
WatchlistSize_scaler=StandardScaler()
train_data['WatchlistSize']=WatchlistSize_scaler.fit_transform(np.array(train_data['WatchlistSize']).reshape(len(train_data['WatchlistSize']),1))

In [96]:
#After Scaling Training Data
train_data[num_col].sample(3)

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,WatchlistSize
130339,-1.023295,-1.054512,-1.056377,0.412288,0.574756,-1.144397,1.660889,-1.21989,0.553522
208693,0.055891,-0.830551,-0.380759,0.600944,0.634545,0.10377,1.235283,-0.871766,1.387664
92263,-1.460804,-1.35599,-1.308645,-0.699346,-0.826827,1.559965,-0.367871,0.868852,1.665711


In [97]:
#Before Scaling Testing data
test_data[num_col].sample(3)

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,WatchlistSize
20954,92,17.079789,1571.340563,39.169309,62.690003,45,3.077225,6,0
18952,109,16.85042,1836.695833,19.022577,78.595024,24,4.381106,6,7
37767,3,14.899692,44.699076,34.769677,160.084006,4,1.283751,1,23


In [98]:
#Feature Scaling 
from sklearn.preprocessing import StandardScaler
for col in num_col:
    print()
    print(f"#{col}")
    print(f"test_data['{col}']={col}_scaler.transform(np.array(test_data['{col}']).reshape(len(test_data['{col}']),1))")


#AccountAge
test_data['AccountAge']=AccountAge_scaler.transform(np.array(test_data['AccountAge']).reshape(len(test_data['AccountAge']),1))

#MonthlyCharges
test_data['MonthlyCharges']=MonthlyCharges_scaler.transform(np.array(test_data['MonthlyCharges']).reshape(len(test_data['MonthlyCharges']),1))

#TotalCharges
test_data['TotalCharges']=TotalCharges_scaler.transform(np.array(test_data['TotalCharges']).reshape(len(test_data['TotalCharges']),1))

#ViewingHoursPerWeek
test_data['ViewingHoursPerWeek']=ViewingHoursPerWeek_scaler.transform(np.array(test_data['ViewingHoursPerWeek']).reshape(len(test_data['ViewingHoursPerWeek']),1))

#AverageViewingDuration
test_data['AverageViewingDuration']=AverageViewingDuration_scaler.transform(np.array(test_data['AverageViewingDuration']).reshape(len(test_data['AverageViewingDuration']),1))

#ContentDownloadsPerMonth
test_data['ContentDownloadsPerMonth']=ContentDownloadsPerMonth_scaler.transform(np.array(test_data['ContentDownloadsPerMonth']).reshape(le

In [99]:
#AccountAge
test_data['AccountAge']=AccountAge_scaler.transform(np.array(test_data['AccountAge']).reshape(len(test_data['AccountAge']),1))

#MonthlyCharges
test_data['MonthlyCharges']=MonthlyCharges_scaler.transform(np.array(test_data['MonthlyCharges']).reshape(len(test_data['MonthlyCharges']),1))

#TotalCharges
test_data['TotalCharges']=TotalCharges_scaler.transform(np.array(test_data['TotalCharges']).reshape(len(test_data['TotalCharges']),1))

#ViewingHoursPerWeek
test_data['ViewingHoursPerWeek']=ViewingHoursPerWeek_scaler.transform(np.array(test_data['ViewingHoursPerWeek']).reshape(len(test_data['ViewingHoursPerWeek']),1))
#AverageViewingDuration
test_data['AverageViewingDuration']=AverageViewingDuration_scaler.transform(np.array(test_data['AverageViewingDuration']).reshape(len(test_data['AverageViewingDuration']),1))

#ContentDownloadsPerMonth
test_data['ContentDownloadsPerMonth']=ContentDownloadsPerMonth_scaler.transform(np.array(test_data['ContentDownloadsPerMonth']).reshape(len(test_data['ContentDownloadsPerMonth']),1))

#UserRating
test_data['UserRating']=UserRating_scaler.transform(np.array(test_data['UserRating']).reshape(len(test_data['UserRating']),1))
#SupportTicketsPerMonth
test_data['SupportTicketsPerMonth']=SupportTicketsPerMonth_scaler.transform(np.array(test_data['SupportTicketsPerMonth']).reshape(len(test_data['SupportTicketsPerMonth']),1))

#WatchlistSize
test_data['WatchlistSize']=WatchlistSize_scaler.transform(np.array(test_data['WatchlistSize']).reshape(len(test_data['WatchlistSize']),1))

In [100]:
#After Scaling
test_data[num_col].sample(3)

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,UserRating,SupportTicketsPerMonth,WatchlistSize
86753,115.0,12.316778,1416.429456,12.151427,84.212898,17.0,4.380782,1.0,10.0
52429,46.0,10.771938,495.509133,39.511954,83.285591,22.0,3.030276,4.9271390000000006e-17,4.0
50912,13.0,18.939876,246.218393,21.833913,7.82746,32.0,3.246179,5.0,12.0


# Feature selection

In [101]:
#Feature and Target
Feature=train_data.drop(columns="Churn")
Target=train_data['Churn']

In [102]:
#Train_Test_Split
from sklearn.model_selection import train_test_split as tts
f_train,f_test,t_train,t_test=tts(Feature,Target,test_size=0.3)

# Here we are using mlxtend libary because Mlxtend is a Python library that provides useful tools for day-to-day data science tasks. It is an extension and helper module for scikit-learn, which is a popular machine learning library in Python. Mlxtend provides a wide range of functionalities such as feature selection, data preprocessing, model evaluation, and ensemble methods. It also includes a variety of visualization tools to help you better understand your data and models.

Some of the key features of Mlxtend include:

Sequential feature selection
Stacking and ensemble methods
Visualizations for decision trees and confusion matrices
Data preprocessing tools such as normalization and scaling
Grid search and cross-validation tools.              mlxtend.feature_selection.SequentialFeatureSelector function in Python. This function is part of the mlxtend library, which is a collection of useful tools for data science tasks. The function allows you to perform sequential feature selection, which is a method of choosing a subset of features that are relevant for a machine learning model
The function takes a machine learning estimator (such as a classifier or a regressor) and a number of features to select as arguments. It then iteratively adds or removes features from the subset based on a scoring criterion until the desired number of features is reached
The use of SequentialFeatureSelector is to perform sequential feature selection, which is a method of choosing a subset of features that are relevant for a machine learning model. Sequential feature selection can help reduce the dimensionality of the data, improve the model performance, and avoid overfitting. Sequential feature selection can be done in two ways: forward or backward. Forward selection starts with an empty subset and adds features one by one based on a scoring criterion. Backward selection starts with the full set of features and removes features one by one based on a scoring criterion. The SequentialFeatureSelector function allows you to specify the number of features to select, the direction of the selection, the scoring criterion, and other parameters. You can use the SequentialFeatureSelector function with any machine learning estimator that has a fit and predict method, such as classifiers or regressors from the scikit-learn library.








In [103]:
#Importing required libraries for feature selection 
import mlxtend
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

#  it creates an instance of the SequentialFeatureSelector class from the mlxtend library. This class is used for feature selection. The k_features parameter specifies the number of features to select. The forward parameter specifies whether to use forward selection or backward elimination. The floating parameter specifies whether to use floating search. The verbose parameter specifies whether to print progress messages. The scoring parameter specifies the scoring metric to use. The cv parameter specifies the number of cross-validation folds.

In [104]:
#Using LogisticRegression backend sequential starts with the full set of features and remove features one by one basedon scoring criterion.
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()

sfs1 = SFS(lr, 
           k_features=19, 
           forward=True, 
           floating=False, 
           verbose=1,
           scoring='accuracy',
           cv=5)

sfs1 = sfs1.fit(Feature, Target)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  19 out of  19 | elapsed:   19.2s finished
Features: 1/19[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:   21.0s finished
Features: 2/19[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:   22.2s finished
Features: 3/19[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:   21.0s finished
Features: 4/19[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   20.5s finished
Features: 5/19[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed:   21.2s finished
Features: 6/19[Parallel(

In [105]:
print('\nSequential Backward Floating Selection (k=3):')
print(sfs1.k_feature_idx_)
print("Feature Name")
print(sfs1.k_feature_names_)
print('CV Score:')
print(sfs1.k_score_)


Sequential Backward Floating Selection (k=3):
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
Feature Name
('AccountAge', 'MonthlyCharges', 'TotalCharges', 'SubscriptionType', 'PaymentMethod', 'PaperlessBilling', 'ContentType', 'MultiDeviceAccess', 'DeviceRegistered', 'ViewingHoursPerWeek', 'AverageViewingDuration', 'ContentDownloadsPerMonth', 'GenrePreference', 'UserRating', 'SupportTicketsPerMonth', 'Gender', 'WatchlistSize', 'ParentalControl', 'SubtitlesEnabled')
CV Score:
0.8235590863269173


In [106]:
#Using DecisionTree
from sklearn.tree import DecisionTreeClassifier
dtc=DecisionTreeClassifier()

sfs2 = SFS(dtc, 
           k_features=19, 
           forward=True, 
           floating=False, 
           verbose=1,
           scoring='accuracy',
           cv=5)

sfs2 = sfs2.fit(Feature, Target)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  19 out of  19 | elapsed:  2.5min finished
Features: 1/19[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:  1.6min finished
Features: 2/19[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:  1.6min finished
Features: 3/19[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:  1.6min finished
Features: 4/19[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  1.7min finished
Features: 5/19[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed:  1.7min finished
Features: 6/19[Parallel(

In [107]:
print('\nSequential Backward Floating Selection (k=3):')
print(sfs2.k_feature_idx_)
print("Feature Name")
print(sfs1.k_feature_names_)
print('CV Score:')
print(sfs2.k_score_)


Sequential Backward Floating Selection (k=3):
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
Feature Name
('AccountAge', 'MonthlyCharges', 'TotalCharges', 'SubscriptionType', 'PaymentMethod', 'PaperlessBilling', 'ContentType', 'MultiDeviceAccess', 'DeviceRegistered', 'ViewingHoursPerWeek', 'AverageViewingDuration', 'ContentDownloadsPerMonth', 'GenrePreference', 'UserRating', 'SupportTicketsPerMonth', 'Gender', 'WatchlistSize', 'ParentalControl', 'SubtitlesEnabled')
CV Score:
0.7280207660626026


# Model Building


In [108]:
# we are LogisticRegression model because this model giving good score we saw that before.
model=LogisticRegression()
#Model Training
model.fit(f_train[['AccountAge', 'MonthlyCharges', 'TotalCharges', 'SubscriptionType', 'PaymentMethod', 'PaperlessBilling', 'ContentType', 'MultiDeviceAccess', 'DeviceRegistered', 'ViewingHoursPerWeek', 'AverageViewingDuration', 'ContentDownloadsPerMonth', 'GenrePreference', 'UserRating', 'SupportTicketsPerMonth', 'Gender', 'WatchlistSize', 'ParentalControl', 'SubtitlesEnabled']],t_train)
#Model Testing
t_pred=model.predict(f_test[['AccountAge', 'MonthlyCharges', 'TotalCharges', 'SubscriptionType', 'PaymentMethod', 'PaperlessBilling', 'ContentType', 'MultiDeviceAccess', 'DeviceRegistered', 'ViewingHoursPerWeek', 'AverageViewingDuration', 'ContentDownloadsPerMonth', 'GenrePreference', 'UserRating', 'SupportTicketsPerMonth', 'Gender', 'WatchlistSize', 'ParentalControl', 'SubtitlesEnabled']])

In [109]:
#Model Performance on Validation Data and checking accuracy score 
from sklearn.metrics import accuracy_score
print(f"Accuracy of model is {accuracy_score(t_test,t_pred)*100}")

Accuracy of model is 82.24975046829923
