Practical Application of Supervised and Unsupervised Learning

Task 1: Classification Algorithms

Logistic Regression Implementation:
- Implement a logistic regression model using a provided dataset.
- Evaluate the model's performance using appropriate metrics.


In [2]:
# import necessary libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

# import the label encoder from scikit learn

from sklearn.preprocessing import LabelEncoder

# import the train test split

from sklearn.model_selection import train_test_split

# import the standardscaler from scikit learn

from sklearn.preprocessing import StandardScaler

# import the regressor

from sklearn.linear_model import LogisticRegression

# import the SVM classifier

from sklearn.svm import SVC

# first import the randomforest classifier

from sklearn.ensemble import RandomForestClassifier

# import the classification report 

from sklearn.metrics import confusion_matrix, classification_report

# import performance matrix
from sklearn.metrics import recall_score, precision_score, f1_score,accuracy_score, mean_squared_error




In [34]:
# load the dataset

data = pd.read_excel('omdena2.xlsx')
data

Unnamed: 0,Age,Weight(kg),Height(cm),Systolic BP,BMI,Overweight
0,13,43.0,157,118.0,17.400000,No
1,14,46.0,159,111.0,18.200000,No
2,10,33.0,132,94.0,19.200000,YES
3,19,65.0,168,126.0,23.100000,No
4,12,31.0,151,87.0,13.800000,No
...,...,...,...,...,...,...
1224,15,45.5,156,112.0,18.696581,No
1225,14,57.0,163,135.0,21.300000,No
1226,16,54.0,158,106.0,21.631149,No
1227,15,47.0,154,109.0,19.900000,No


In [35]:
# display information about the variables

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1229 entries, 0 to 1228
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          1229 non-null   int64  
 1   Weight(kg)   1229 non-null   float64
 2   Height(cm)   1229 non-null   int64  
 3   Systolic BP  1224 non-null   float64
 4   BMI          1229 non-null   float64
 5   Overweight   1229 non-null   object 
dtypes: float64(3), int64(2), object(1)
memory usage: 57.7+ KB


In [36]:
# check if there are missing values

data.isna().sum()

Age            0
Weight(kg)     0
Height(cm)     0
Systolic BP    5
BMI            0
Overweight     0
dtype: int64

In [37]:
# drop null values

data = data.dropna(subset='Systolic BP')

In [38]:
# check that no null value exist in the data now

data.isnull().sum()

Age            0
Weight(kg)     0
Height(cm)     0
Systolic BP    0
BMI            0
Overweight     0
dtype: int64

In [39]:
# explore the dependent variable

data['Overweight'].unique()

array(['No', 'YES'], dtype=object)

In [40]:
# perform label encoding on the categorical dependent variable
# initialize the label encoder

labeller = LabelEncoder()

# perform label encoding on the variable (overweight)

data['Overweight'] = labeller.fit_transform(data['Overweight'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Overweight'] = labeller.fit_transform(data['Overweight'])


In [41]:
# display first few rows to see the changes

data.head()

Unnamed: 0,Age,Weight(kg),Height(cm),Systolic BP,BMI,Overweight
0,13,43.0,157,118.0,17.4,0
1,14,46.0,159,111.0,18.2,0
2,10,33.0,132,94.0,19.2,1
3,19,65.0,168,126.0,23.1,0
4,12,31.0,151,87.0,13.8,0


In [42]:
# separate the data into features and response variables

X = data.drop('Overweight', axis=1)
y = data['Overweight']

In [43]:
# train and test splitting of data

# split the data into train, and test in the ratio 0.8:0.2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 41 )

In [44]:
# next, we scale our data to optimized scaling

# initialize the standard scaler

sc = StandardScaler()

# fit and transform the data with the standard scaler

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Training Logistic regression model

In [45]:
# initialize the regressor

lr = LogisticRegression()

# fit the model

lr.fit(X_train, y_train)

In [46]:
# display the r2 value to see how well the model was able to fit the data

print(lr.score(X_train, y_train))

0.898876404494382


Since the value is closer to 1, it means the model did fit the data well

In [47]:
# do prediction using the test set

y_predicted = lr.predict(X_test)

In [48]:
# evaluate the model performance

# print the performance matrix


print("Accuracy:", "%.3f" % accuracy_score(y_test, y_predicted))
print("Precision:", "%.3f" % precision_score(y_test, y_predicted))
print("Recall:", "%.3f" % recall_score(y_test, y_predicted))
print("F1 Score:", "%.3f" % f1_score(y_test, y_predicted))

Accuracy: 0.914
Precision: 0.929
Recall: 0.684
F1 Score: 0.788


Support vector machine implementation

In [49]:
# initialize the classifier

svc = SVC()

# fit the data into the classifier

svc.fit(X_train, y_train)

# make predictions on the testing set

pred_svc = svc.predict(X_test)

In [50]:
# evaluate the performance

print("Accuracy:", "%.3f" % accuracy_score(y_test, pred_svc))
print("Precision:", "%.3f" % precision_score(y_test, pred_svc))
print("Recall:", "%.3f" % recall_score(y_test, pred_svc))
print("F1 Score:", "%.3f" % f1_score(y_test, pred_svc))

Accuracy: 0.935
Precision: 0.873
Recall: 0.842
F1 Score: 0.857


From the observed performance matrics, the logistic regressor has a higher precision than the svm classifier with both models having approximately the same accuracy

Random Forest Application:
- Implement a Random Forest classifier with a different dataset.
- Discuss scenarios where Random Forest might outperform other classifiers.


In [20]:
# import the dataset

data = pd.read_csv(r'Telco-Customer-Churn.csv')

In [21]:
# display the first few rows

data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [22]:
# select variables of interest
# churn is the target variable

data = data[['SeniorCitizen','Partner', 'tenure', 'PhoneService', 'PaperlessBilling','MonthlyCharges','Churn' ]]

In [23]:
# encode categorical data using the labelencoder
# initialize the encoder

lb = LabelEncoder()

# perform label encoding

data['Partner'] = lb.fit_transform(data['Partner'])
data['PhoneService'] = lb.fit_transform(data['PhoneService'])
data['PaperlessBilling'] = lb.fit_transform(data['PaperlessBilling'])
data['Churn'] = lb.fit_transform(data['Churn'])


In [24]:
# preview data

data

Unnamed: 0,SeniorCitizen,Partner,tenure,PhoneService,PaperlessBilling,MonthlyCharges,Churn
0,0,1,1,0,1,29.85,0
1,0,0,34,1,0,56.95,0
2,0,0,2,1,1,53.85,1
3,0,0,45,0,0,42.30,0
4,0,0,2,1,1,70.70,1
...,...,...,...,...,...,...,...
7038,0,1,24,1,1,84.80,0
7039,0,1,72,1,1,103.20,0
7040,0,1,11,0,1,29.60,0
7041,1,1,4,1,1,74.40,1


In [25]:
# separate data to features and target

X = data.drop('Churn', axis=1)
y = data['Churn']

In [26]:
# divide set to train and test

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=10)

Model training

In [27]:
# initialize the classifier

rfc = RandomForestClassifier()

In [28]:
# train the data using the classifier

rfc.fit(X_train, y_train)

In [29]:
# predict using X_test

pred_rfc = rfc.predict(X_test)

In [30]:
# check model performance

# print the classification report

print(classification_report(y_test, pred_rfc))

              precision    recall  f1-score   support

           0       0.84      0.88      0.86      1066
           1       0.55      0.48      0.51       343

    accuracy                           0.78      1409
   macro avg       0.70      0.68      0.69      1409
weighted avg       0.77      0.78      0.77      1409



The model has an accuracy of 0.78 

scenarios where Random Forest might outperform other classifiers.

1. Random forest prevents overfitting: Therefore, in cases where overfitting is not desired, a random forest classifier will be most accurate
2. Reandom forest classifiers turn to have higher accuracies: A random forest classifier will outperform other classifiers in scenarios where a high accuracy of the model is desired especially in healthcare use-cases

Ensemble Model Experimentation:
- Combine the models from tasks 1 and 2 using ensemble techniques (e.g., Voting Classifier).
- Evaluate the ensemble model's performance and explain any observed improvements.


In [31]:
from sklearn.ensemble import VotingClassifier

In [51]:
# initialize the voting classifier

vc = VotingClassifier([('lr',lr), ('svc',svc)])

In [52]:
# fit the data to the voting classifier

vc.fit(X_train, y_train)

In [53]:
# use the ensemble for prediction

vc_predicted = vc.predict(X_test) 

In [55]:
# display the accuracy

print("Accuracy:", "%.3f" % accuracy_score(y_test, vc_predicted))
print("Precision:", "%.3f" % precision_score(y_test, vc_predicted))
print("Recall:", "%.3f" % recall_score(y_test, vc_predicted))
print("F1 Score:", "%.3f" % f1_score(y_test, vc_predicted))

Accuracy: 0.914
Precision: 0.929
Recall: 0.684
F1 Score: 0.788


The ensemble classifier performance is same as that of the logistic regression. Hence, the ensembel classifier in this case was not very useful since an observed improvement in predictions was not encountered