# Classification

## Table of content 

### 1 Introduction to Classification

* What is classification?
* Types of classification problems
* Real-world examples of classification tasks

### 2 Getting Started with scikit-learn

* Installing scikit-learn
* Understanding the scikit-learn API
* Loading datasets and preprocessing

### 3 Supervised Learning: Classification Techniques

* a. Logistic Regression
* b. K-Nearest Neighbors (KNN)
* c. Support Vector Machines (SVM)
* d. Decision Trees
* e. Random Forest
* f. Naive Bayes

### 4 Model Evaluation and Selection

* Train-test split
* Cross-validation
* Performance metrics (accuracy, precision, recall, F1-score, ROC-AUC)
* Hyperparameter tuning (Grid search, Random search)
* Model selection and comparison

### 5 Advanced Classification Techniques
* a. Ensemble Methods
* b. Imbalanced Classification

### 6 Feature Selection and Dimensionality Reduction

* Variance threshold
* Recursive feature elimination (RFE)
* Principal component analysis (PCA)
* Linear discriminant analysis (LDA)

### 7  Practical Project

* Choosing a classification dataset
* Data preprocessing and exploration
* Model selection, training, and evaluation
* Hyperparameter tuning and model optimization
* Presenting the final results  

### 8 Assignment

## 1 Introduction to Classification

### What is classification?

Classification is a type of supervised machine learning task in which the goal is to assign objects or instances to predefined categories or classes. In supervised learning, the model learns from a dataset that contains input-output pairs, where the output (or target) is a discrete value representing the class label. Classification models can be used to predict the class of an object based on its input features.

!["classification"](image1.png)

### Types of classification problems:

There are two main types of classification problems:
    
    

* a. Binary Classification: In binary classification, there are only two possible classes. The model is trained to distinguish between these two classes. For example, classifying emails as spam or not spam.

* b. Multiclass Classification: In multiclass classification, there are more than two possible classes. The model is trained to classify instances into one of the multiple classes. For example, classifying handwritten digits into one of the ten classes (0 to 9).

In some cases, you might also encounter multilabel classification problems, where each instance can be assigned to multiple classes simultaneously. For example, classifying a text document into multiple topics.

### Real-world examples of classification tasks:


Here are some real-world examples of classification tasks:

* a. Email spam detection: Identifying whether an email is spam or not based on its content and other features.

* b. Medical diagnosis: Predicting the presence or absence of a disease based on patient data (e.g., symptoms, lab results).

* c. Sentiment analysis: Determining the sentiment (positive, negative, or neutral) of a given text or document.

* d. Handwritten digit recognition: Identifying the digit (0 to 9) represented by a handwritten image.

* e. Fraud detection: Detecting fraudulent transactions in a financial dataset based on transaction data and user behavior.

* f. Image classification: Categorizing images into predefined classes, such as animals, objects, or scenes.

* g. Customer segmentation: Classifying customers into groups based on their behavior or preferences for targeted marketing.

These are just a few examples; classification problems are widespread across various domains and industries.

## 2 Getting Started with scikit-learn

### Installing scikit-learn:

Scikit-learn is a popular Python library for machine learning that provides simple and efficient tools for data mining and data analysis. To install scikit-learn, you can use the following command with pip:

In [None]:
pip install scikit-learn

Or, if you are using Anaconda, use the following command with conda:

### Understanding the scikit-learn API:


Scikit-learn follows a consistent and easy-to-understand API structure. The main components of the API include:

* a. **Estimators**: The base object for all algorithms in scikit-learn. Estimators can be classifiers, regressors, or transformers. They expose a fit method for learning a model from the training data.

* b. **Transformers**: These are special types of estimators that can transform the input data. They expose a transform method for converting the input data, and a fit_transform method for fitting and transforming the data in a single step.

* c. **Predictors**: These are estimators that can make predictions given the input data. They expose a predict method for generating predictions and a score method for evaluating the quality of the predictions.

### Loading datasets and preprocessing:

Scikit-learn provides various utilities for loading datasets and preprocessing the data. Some common tasks include:

* a. **Loading datasets**: Scikit-learn comes with several built-in datasets (e.g., iris, digits, breast cancer) that can be loaded using the datasets module.

In [4]:
from sklearn import datasets

iris = datasets.load_iris()
#iris.data

* b. **Data splitting**: Split the data into training and testing sets using the train_test_split function from the model_selection module.

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

* c. **Feature scaling**  Standardize or normalize the data using transformers like StandardScaler or MinMaxScaler from the preprocessing module.

In [6]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

* **Handling missing values** : Impute missing values using transformers like SimpleImputer from the impute module.

In [7]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

By understanding the scikit-learn API and its utilities, you can load, preprocess, and prepare your data for various classification tasks.

# 3 Supervised Learning: Classification Techniques

### a. Logistic Regression

#### Understanding logistic regression: 

Logistic regression is a linear model used for binary classification tasks. It estimates the probability of an instance belonging to a class using the logistic function (sigmoid function). The model is trained to find the best-fitting decision boundary that separates the two classes.

#### Implementing logistic regression with scikit-learn:

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)
predictions = logreg.predict(X_test_scaled)

### b. K-Nearest Neighbors (KNN)

#### Understanding KNN: 

K-Nearest Neighbors is a non-parametric, instance-based learning algorithm used for classification tasks. Given a new instance, KNN finds the k nearest training instances in the feature space and assigns the majority class label among these neighbors.

#### Implementing KNN with scikit-learn:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)
predictions = knn.predict(X_test_scaled)

### c. Support Vector Machines (SVM)

#### Understanding SVM:
Support Vector Machines is a powerful classification algorithm that can be used for linear or non-linear classification tasks. The main idea of SVM is to find the hyperplane that best separates the classes with the maximum margin, which is the distance between the hyperplane and the nearest instances from each class (support vectors).

#### Implementing SVM with scikit-learn:

In [None]:
from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1)
svm.fit(X_train_scaled, y_train)
predictions = svm.predict(X_test_scaled)

### d. Decision Trees

#### Understanding decision trees: 
Decision trees are a type of flowchart-like structure used for classification tasks. The tree consists of nodes, which represent features or decisions, and branches, which represent the outcome of a decision. The model is trained to recursively split the data based on the feature that provides the best separation of the classes.

#### Implementing decision trees with scikit-learn:

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth=3, random_state=62)
dtree.fit(X_train_scaled, y_train)
predictions = dtree.predict(X_test_scaled)

### e. Random Forest

#### Understanding random forest: 

Random Forest is an ensemble learning method that constructs multiple decision trees and combines their predictions through majority voting. It improves the performance and generalization of a single decision tree by reducing overfitting and adding randomness to the model.

#### Implementing random forest with scikit-learn:

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
rf.fit(X_train_scaled, y_train)
predictions = rf.predict(X_test_scaled)

### Naive Bayes

#### Understanding naive bayes: 

Naive Bayes is a family of probabilistic classification algorithms based on Bayes' theorem, with an assumption of independence between features. Common types include Gaussian Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes. It is particularly useful for large-scale text classification tasks.

#### Implementing naive bayes with scikit-learn:




In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train_scaled, y_train)
predictions = gnb.predict(X_test_scaled)

## 4 Model Evaluation and Selection

### Train-test split:
The train-test split is a technique used to divide the dataset into two parts, one for training the model and the other for testing the model's performance. This helps to evaluate the model's ability to generalize to unseen data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Cross-validation:
Cross-validation is a more robust technique for evaluating the model's performance by dividing the dataset into multiple folds. The model is trained and tested multiple times, using different combinations of training and testing folds. The most common method is k-fold cross-validation.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(gnb, X_train_scaled, y_train, cv=5)  # 5-fold cross-validation

### Performance metrics:

Various performance metrics can be used to evaluate the quality of a classification model. Some common metrics include:

* **Accuracy**: The proportion of correct predictions among the total number of instances.
* **Precision**: The proportion of true positives among the instances predicted as positive.
* **Recall (Sensitivity)**: The proportion of true positives among the instances that are actually positive.
* **F1-score:** The harmonic mean of precision and recall.
* **ROC-AUC:** The area under the receiver operating characteristic (ROC) curve, which plots the true positive rate against the false positive rate.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)

### Hyperparameter tuning:

Hyperparameter tuning is the process of finding the optimal values for the hyperparameters of a model. Two popular methods for hyperparameter tuning are grid search and random search.

#### Grid search: 
Exhaustively tries all possible combinations of hyperparameter values.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)
grid_search.best_params_  

#### Random search: 

Samples a random combination of hyperparameter values within specified ranges.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
param_dist = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
random_search = RandomizedSearchCV(SVC(), param_dist, n_iter=10, cv=5)
random_search.fit(X_train_scaled, y_train)

### Model selection and comparison:

After evaluating the performance of different models and tuning their hyperparameters, you can compare the models and select the one that performs best on the chosen metrics. This will help you choose the most suitable model for your classification task.

In [None]:
# Example: Comparing the accuracy of two models
accuracy_logreg = accuracy_score(y_test, predictions_logreg)
accuracy_knn = accuracy_score(y_test, predictions_knn)
print("Logistic Regression Accuracy:", accuracy_logreg)
print("K-Nearest Neighbors Accuracy:", accuracy_knn)

### 5 Advanced Classification Techniques

### a. Ensemble Methods

Ensemble methods are techniques that combine the predictions of multiple base models to improve the overall performance and generalization of the final model. There are three main types of ensemble methods:

#### Bagging: 
Bagging (Bootstrap Aggregating) is an ensemble method that trains multiple base models, typically decision trees, using different subsets of the training data, sampled with replacement. The final prediction is obtained by averaging (for regression) or majority voting (for classification).

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)
bagging.fit(X_train_scaled, y_train)
predictions = bagging.predict(X_test_scaled)

#### Boosting: 

Boosting algorithms train a sequence of weak models, typically decision trees, with each model learning from the errors of its predecessor. Popular boosting algorithms include AdaBoost and Gradient Boosting.

##### AdaBoost:

In [None]:
from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier(n_estimators=50, learning_rate=1, random_state=42)
adaboost.fit(X_train_scaled, y_train)
predictions = adaboost.predict(X_test_scaled)

##### Gradient Boosting:


In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gboost = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gboost.fit(X_train_scaled, y_train)
predictions = gboost.predict(X_test_scaled)

#### Stacking: 
Stacking combines the predictions of multiple base models using a meta-model, which is trained on the output of the base models. This allows the meta-model to learn how to best combine the predictions of the base models.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
base_estimators = [('logreg', LogisticRegression()), ('knn', KNeighborsClassifier()), ('svm', SVC())]
stacking = StackingClassifier(estimators=base_estimators, final_estimator=LogisticRegression())
stacking.fit(X_train_scaled, y_train)
predictions = stacking.predict(X_test_scaled)

### b. Imbalanced Classification
Imbalanced classification deals with datasets where one class is significantly under-represented compared to the other classes. This can lead to biased models that perform poorly on the minority class.

#### Understanding imbalanced datasets: 

Imbalanced datasets can occur in real-world problems like fraud detection, medical diagnosis, and rare event prediction. The imbalance can lead to a higher misclassification rate for the minority class, as the model is biased towards the majority class.

#### Resampling techniques: 

Resampling techniques can be used to balance the class distribution by either oversampling the minority class or undersampling the majority class.

* **Oversampling**: Randomly replicating instances from the minority class to increase its representation.

In [8]:
pip install imblearn




In [None]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority')
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

* **Undersampling**: Randomly removing instances from the majority class to decrease its representation.


In [None]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy='majority')
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

#### Evaluation metrics for imbalanced datasets: 

In imbalanced datasets, accuracy is not a suitable metric, as it can be misleading due to the bias towards the majority class. Instead, other metrics like precision, recall, F1-score, and the area under the precision-recall curve (PR-AUC) should be used.

Precision-Recall Curve and PR-AUC:

In [None]:
from sklearn.metrics import precision_recall_curve, auc
precision, recall, _ = precision_recall_curve(y_test, predictions)
pr_auc = auc(recall, precision)

By using advanced classification techniques like ensemble methods and addressing imbalanced datasets with resampling techniques, you can improve the performance and generalization of your classification models. Additionally, using appropriate evaluation metrics will help you better assess and compare models on imbalanced datasets.

## 6 Feature Selection and Dimensionality Reduction

Feature selection and dimensionality reduction techniques help to select the most important features and reduce the number of features used in a model. This can lead to :-
 - simpler models
 - lower computational cost
 -  improved performance.

### Variance threshold:
Variance threshold is a simple feature selection technique that removes features with variance below a certain threshold. This is based on the assumption that features with low variance don't contribute much to the model's performance.

In [None]:
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
X_high_variance = selector.fit_transform(X)

### Recursive feature elimination (RFE):

RFE is a feature selection technique that iteratively removes features based on their importance scores, which can be calculated using the coefficients or feature importances of a model. RFE ranks the features and selects the best subset

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=5, step=1)
X_rfe = selector.fit_transform(X, y)

### Principal component analysis (PCA):
PCA is a dimensionality reduction technique that projects the original data into a lower-dimensional space while preserving the maximum variance. It can help to reduce noise and improve the performance of models, especially when dealing with high-dimensional data.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)  # Reduce to 2 principal components
X_pca = pca.fit_transform(X)

### Linear discriminant analysis (LDA):

LDA is a supervised dimensionality reduction technique that seeks to find a linear combination of features that maximizes the separation between classes. LDA can improve the performance of classification models by reducing the number of features while preserving class separability.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=1)  # Reduce to 1 linear discriminant
X_lda = lda.fit_transform(X, y)

By applying feature selection and dimensionality reduction techniques, you can simplify your classification models and potentially improve their performance. These techniques can also help to reduce the computational cost and training time of your models.

## 7 Practical Project

In this practical project, we will go through the process of choosing a real-world classification dataset, preprocessing and exploring the data, selecting, training, and evaluating models, tuning hyperparameters, and presenting the final results.



### 1 Choosing a classification dataset:
Find a suitable classification dataset for your project. Examples include the Iris dataset, the Breast Cancer Wisconsin dataset, or the Wine Quality dataset. You can also explore public datasets available on platforms like Kaggle or UCI Machine Learning Repository.

### 2 Data preprocessing and exploration:
Load the dataset, clean the data if necessary, and perform exploratory data analysis to understand the data's characteristics.

In this example, I'll use the Wine Quality dataset from the UCI Machine Learning Repository. Here's the code to load the data and perform exploratory data analysis.

### 3 Model selection, training, and evaluation:

Split the data into training and testing sets, train different classification models, and evaluate their performance using appropriate metrics.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, sep=";")

# Preprocessing
X = data.drop('quality', axis=1)
y = data['quality']

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Training and evaluating models
#logisticregression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
logreg_preds = logreg.predict(X_test)

#random forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

#svc
svc = SVC()
svc.fit(X_train, y_train)
svc_preds = svc.predict(X_test)

# Metrics
def evaluate(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    return accuracy, precision, recall, f1

logreg_metrics = evaluate(y_test, logreg_preds)
rf_metrics = evaluate(y_test, rf_preds)
svc_metrics = evaluate(y_test, svc_preds)

print("Logistic Regression Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*logreg_metrics))
print("Random Forest Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*rf_metrics))
print("SVM Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*svc_metrics))



Logistic Regression Metrics - Accuracy: 0.5645833333333333, Precision: 0.5253065757465923, Recall: 0.5645833333333333, F1-score: 0.54012725727842
Random Forest Metrics - Accuracy: 0.66875, Precision: 0.6389462477359917, Recall: 0.66875, F1-score: 0.6494791757494375
SVM Metrics - Accuracy: 0.60625, Precision: 0.5723558062937009, Recall: 0.60625, F1-score: 0.5792536090757506


This code loads the Wine Quality dataset, preprocesses it, splits it into training and testing sets, trains three classification models, and evaluates their performance using accuracy, precision, recall, and F1-score.

In [15]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### 4 Hyperparameter tuning and model optimization:
    
Optimize the models' performance by tuning their hyperparameters using techniques like grid search or random search.

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, sep=";")

# Preprocessing
X = data.drop('quality', axis=1)
y = data['quality']

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Training and evaluating the baseline model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

# Metrics
def evaluate(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    return accuracy, precision, recall, f1

rf_metrics = evaluate(y_test, rf_preds)
print("Baseline Random Forest Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*rf_metrics))

# Hyperparameter tuning using Grid Search
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

# Training and evaluating the optimized model
rf_optimized = RandomForestClassifier(**best_params)
rf_optimized.fit(X_train, y_train)
rf_optimized_preds = rf_optimized.predict(X_test)

rf_optimized_metrics = evaluate(y_test, rf_optimized_preds)
print("Optimized Random Forest Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*rf_optimized_metrics))


Baseline Random Forest Metrics - Accuracy: 0.6604166666666667, Precision: 0.6325637254901962, Recall: 0.6604166666666667, F1-score: 0.6437747929338047
Optimized Random Forest Metrics - Accuracy: 0.6708333333333333, Precision: 0.6438266594516595, Recall: 0.6708333333333333, F1-score: 0.654023433244079


This code loads the Wine Quality dataset, preprocesses it, splits it into training and testing sets, trains a baseline Random Forest model, and evaluates its performance. Then, it performs hyperparameter tuning using Grid Search and retrains the optimized model, evaluating its performance to compare with the baseline.

### 5 Presenting the final results:
After optimizing the models, select the best model based on the chosen evaluation metrics, and present the final results, including the model's performance on the test dataset, feature importances or coefficients, and any insights derived from the analysis.

In [3]:
%%time
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(url, sep=";")

# Preprocessing
X = data.drop('quality', axis=1)
y = data['quality']

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Hyperparameter tuning using Grid Search
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

# Training and evaluating the optimized model
rf_optimized = RandomForestClassifier(**best_params)
rf_optimized.fit(X_train, y_train)
rf_optimized_preds = rf_optimized.predict(X_test)

# Metrics
def evaluate(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    return accuracy, precision, recall, f1

rf_optimized_metrics = evaluate(y_test, rf_optimized_preds)
print("Optimized Random Forest Metrics - Accuracy: {}, Precision: {}, Recall: {}, F1-score: {}".format(*rf_optimized_metrics))

# Identifying important features
important_features = pd.Series(rf_optimized.feature_importances_, index=X.columns)
important_features = important_features.sort_values(ascending=False)

print("\nImportant Features:")
print(important_features)

# Conclusion
print("\nBased on the evaluation metrics, the Optimized Random Forest model is the best-performing model.")
print("The top features contributing to wine quality prediction are:")
print(important_features.head(5))


Optimized Random Forest Metrics - Accuracy: 0.6625, Precision: 0.6283094965675058, Recall: 0.6625, F1-score: 0.6424449181091546

Important Features:
alcohol                 0.166532
sulphates               0.115290
volatile acidity        0.105170
total sulfur dioxide    0.102240
density                 0.095069
chlorides               0.076533
citric acid             0.070862
pH                      0.069892
fixed acidity           0.068946
residual sugar          0.064850
free sulfur dioxide     0.064615
dtype: float64

Based on the evaluation metrics, the Optimized Random Forest model is the best-performing model.
The top features contributing to wine quality prediction are:
alcohol                 0.166532
sulphates               0.115290
volatile acidity        0.105170
total sulfur dioxide    0.102240
density                 0.095069
dtype: float64
CPU times: total: 6min 6s
Wall time: 7min 3s


This code loads the Wine Quality dataset, preprocesses it, splits it into training and testing sets, performs hyperparameter tuning using Grid Search, and trains the optimized Random Forest model. It then evaluates the model's performance on the test dataset, identifies the important features, and presents the final results.

In the conclusion, we report the best model based on the chosen evaluation metrics and list the top features contributing to wine quality prediction.

In [None]:
# Adittional notes

In [None]:
pip install imblearn

In [None]:
#oversampling
import imblearn
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy='minority')
X_oresampled, y_oresampled = ros.fit_resample(X, y)

In [None]:
y_oresampled.value_counts()

In [None]:
#Undersampling
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy='majority')
X_uresampled, y_uresampled = rus.fit_resample(X, y)

In [None]:
y_uresampled.value_counts()

## 8 Assignment

## Assignment: Predicting Customer Churn

### Objective: 

The goal of this assignment is to build a classification model to predict whether a customer will churn (stop using a service) based on their features and interactions with the service.

### Dataset: 
The Telco Customer Churn dataset, available on Kaggle, contains information about a fictional telecommunication company's customers and whether they have churned. You can download the dataset here.
 https://www.kaggle.com/datasets/blastchar/telco-customer-churn

### Tasks:

* Load and explore the dataset: Analyze the distribution of features, check for missing values, and visualize relationships between features and the target variable (churn).

    
    
* Preprocess the data: Handle missing values, convert categorical variables to numeric, and normalize/standardize the features if necessary.

    
    
* Split the dataset: Divide the dataset into training and testing sets.

    
    
* Train classification models: Train various classification models (e.g., logistic regression, KNN, SVM, decision tree, random forest, etc.) on the training dataset.

    
    
* Evaluate the models: Assess the performance of the models using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.

    
    
* Optimize the models: Perform hyperparameter tuning using techniques like grid search or random search to improve the performance of the models.

    
    
* Feature selection and dimensionality reduction: Apply feature selection techniques such as RFE, variance threshold, or dimensionality reduction methods like PCA and LDA to reduce the number of features and potentially improve model performance.

    
    
* Select the best model: Choose the best-performing model based on the evaluation metrics.

    
    
* Interpret the results: Discuss the performance of the chosen model, the importance of different features, and any insights gained from the analysis.

    
    
* Conclusion: Summarize the findings, mention any limitations of the project, and suggest possible improvements or future work.

In [25]:
#libraries
import pandas as pd
import numpy as np

In [26]:
# Load dataset
data = pd.read_csv('data/Telco-Customer-Churn.csv')

In [27]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   int32  
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   int32  
 4   Dependents        7043 non-null   int32  
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   int32  
 7   MultipleLines     7043 non-null   int32  
 8   InternetService   7043 non-null   int32  
 9   OnlineSecurity    7043 non-null   int32  
 10  OnlineBackup      7043 non-null   int32  
 11  DeviceProtection  7043 non-null   int32  
 12  TechSupport       7043 non-null   int32  
 13  StreamingTV       7043 non-null   int32  
 14  StreamingMovies   7043 non-null   int32  
 15  Contract          7043 non-null   int32  
 16  PaperlessBilling  7043 non-null   int32  


In [5]:
data.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [9]:
data.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [30]:
data.Churn.value_counts()

Churn
No     5174
Yes    1869
Name: count, dtype: int64

In [10]:
#Label encoding
from sklearn.preprocessing import LabelEncoder

for column in data.select_dtypes(include=['object']):
    if column != 'customerID':
        data[column] = LabelEncoder().fit_transform(data[column])


In [11]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,1,2,29.85,2505,0
1,5575-GNVDE,1,0,0,0,34,1,0,0,2,...,2,0,0,0,1,0,3,56.95,1466,0
2,3668-QPYBK,1,0,0,0,2,1,0,0,2,...,0,0,0,0,0,1,3,53.85,157,1
3,7795-CFOCW,1,0,0,0,45,0,1,0,2,...,2,2,0,0,1,0,0,42.3,1400,0
4,9237-HQITU,0,0,0,0,2,1,0,1,0,...,0,0,0,0,0,1,2,70.7,925,1


In [13]:
#Train_test_split
from sklearn.model_selection import train_test_split

X = data.drop(['customerID', 'Churn'], axis=1)
y = data['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [14]:
# Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Models

In [16]:
# Logistic Regression
# RandomForest Classifier
%time

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg_preds = logreg.predict(X_test)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

CPU times: total: 0 ns
Wall time: 0 ns


### Evaluation metrics

In [19]:
#Evaluating the performance of the two models
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#Logistic regression
logreg_accuracy = accuracy_score(y_test, logreg_preds)
logreg_precision = precision_score(y_test, logreg_preds)
logreg_recall = recall_score(y_test, logreg_preds)
logreg_f1 = f1_score(y_test, logreg_preds)

#random forest classifier
rf_accuracy = accuracy_score(y_test, rf_preds)
rf_precision = precision_score(y_test, rf_preds)
rf_recall = recall_score(y_test, rf_preds)
rf_f1 = f1_score(y_test, rf_preds)

#output

print(f"Logistic Regression - Accuracy: {logreg_accuracy}, Precision: {logreg_precision}, Recall: {logreg_recall}, F1: {logreg_f1}")
print(f"Random Forest - Accuracy: {rf_accuracy}, Precision: {rf_precision}, Recall: {rf_recall}, F1: {rf_f1}")

Logistic Regression - Accuracy: 0.8078561287269286, Precision: 0.6842105263157895, Recall: 0.5435540069686411, F1: 0.6058252427184466
Random Forest - Accuracy: 0.795551348793185, Precision: 0.6775, Recall: 0.4721254355400697, F1: 0.5564681724845997


### GridSearch

In [31]:
%time

# Hyperparameter tuning
param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print(best_params)

CPU times: total: 0 ns
Wall time: 0 ns
{'max_depth': 10, 'n_estimators': 50}


In [22]:
# Feature selection
rfe = RFE(RandomForestClassifier(**best_params), n_features_to_select=10)
X_train_rfe = rfe.fit_transform(X_train, y_train)
X_test_rfe = rfe.transform(X_test)

rf_optimized = RandomForestClassifier(**best_params)
rf_optimized.fit(X_train_rfe, y_train)
rf_optimized_preds = rf_optimized.predict(X_test_rfe)

rf_optimized_accuracy = accuracy_score(y_test, rf_optimized_preds)
rf_optimized_precision = precision_score(y_test, rf_optimized_preds)
rf_optimized_recall = recall_score(y_test, rf_optimized_preds)
rf_optimized_f1 = f1_score(y_test, rf_optimized_preds)

print(f"Optimized Random Forest - Accuracy: {rf_optimized_accuracy}, Precision: {rf_optimized_precision}, Recall: {rf_optimized_recall}, F1: {rf_optimized_f1}")


Optimized Random Forest - Accuracy: 0.7974443918599148, Precision: 0.665158371040724, Recall: 0.5121951219512195, F1: 0.578740157480315


In [23]:

#Identifying important features
important_features = pd.Series(rf_optimized.feature_importances_, index=X.columns[rfe.support_])
important_features = important_features.sort_values(ascending=False)

print("\nImportant Features:")
print(important_features)


Important Features:
tenure              0.221821
MonthlyCharges      0.189077
TotalCharges        0.154576
Contract            0.143158
OnlineSecurity      0.088579
TechSupport         0.058666
PaymentMethod       0.051025
InternetService     0.040232
OnlineBackup        0.027201
PaperlessBilling    0.025664
dtype: float64


In [24]:

#Conclusion
print("\nBased on the evaluation metrics, the Optimized Random Forest model is the best-performing model.")
print("The top features contributing to customer churn prediction are:")
print(important_features.head(5))


Based on the evaluation metrics, the Optimized Random Forest model is the best-performing model.
The top features contributing to customer churn prediction are:
tenure            0.221821
MonthlyCharges    0.189077
TotalCharges      0.154576
Contract          0.143158
OnlineSecurity    0.088579
dtype: float64


# solution for the Customer Churn assignment

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE

# Load dataset
data = pd.read_csv('data/Telco-Customer-Churn.csv')

# Preprocessing
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
data['TotalCharges'].fillna(data['TotalCharges'].mean(), inplace=True)

for column in data.select_dtypes(include=['object']):
    if column != 'customerID':
        data[column] = LabelEncoder().fit_transform(data[column])

# Splitting dataset
X = data.drop(['customerID', 'Churn'], axis=1)
y = data['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Training and evaluating models
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg_preds = logreg.predict(X_test)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

# Metrics
logreg_accuracy = accuracy_score(y_test, logreg_preds)
logreg_precision = precision_score(y_test, logreg_preds)
logreg_recall = recall_score(y_test, logreg_preds)
logreg_f1 = f1_score(y_test, logreg_preds)

rf_accuracy = accuracy_score(y_test, rf_preds)
rf_precision = precision_score(y_test, rf_preds)
rf_recall = recall_score(y_test, rf_preds)
rf_f1 = f1_score(y_test, rf_preds)

print(f"Logistic Regression - Accuracy: {logreg_accuracy}, Precision: {logreg_precision}, Recall: {logreg_recall}, F1: {logreg_f1}")
print(f"Random Forest - Accuracy: {rf_accuracy}, Precision: {rf_precision}, Recall: {rf_recall}, F1: {rf_f1}")

# Hyperparameter tuning
param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

# Feature selection
rfe = RFE(RandomForestClassifier(**best_params), n_features_to_select=10)
X_train_rfe = rfe.fit_transform(X_train, y_train)
X_test_rfe = rfe.transform(X_test)

rf_optimized = RandomForestClassifier(**best_params)
rf_optimized.fit(X_train_rfe, y_train)
rf_optimized_preds = rf_optimized.predict(X_test_rfe)

rf_optimized_accuracy = accuracy_score(y_test, rf_optimized_preds)
rf_optimized_precision = precision_score(y_test, rf_optimized_preds)
rf_optimized_recall = recall_score(y_test, rf_optimized_preds)
rf_optimized_f1 = f1_score(y_test, rf_optimized_preds)

print(f"Optimized Random Forest - Accuracy: {rf_optimized_accuracy}, Precision: {rf_optimized_precision}, Recall: {rf_optimized_recall}, F1: {rf_optimized_f1}")


#Identifying important features
important_features = pd.Series(rf_optimized.feature_importances_, index=X.columns[rfe.support_])
important_features = important_features.sort_values(ascending=False)

print("\nImportant Features:")
print(important_features)


#Conclusion
print("\nBased on the evaluation metrics, the Optimized Random Forest model is the best-performing model.")
print("The top features contributing to customer churn prediction are:")
print(important_features.head(5))

Logistic Regression - Accuracy: 0.8106956933270232, Precision: 0.68125, Recall: 0.5696864111498258, F1: 0.6204933586337761
Random Forest - Accuracy: 0.7903454803596782, Precision: 0.6617283950617284, Recall: 0.46689895470383275, F1: 0.5474974463738509
Optimized Random Forest - Accuracy: 0.8021769995267393, Precision: 0.6733333333333333, Recall: 0.5278745644599303, F1: 0.591796875

Important Features:
TotalCharges        0.204281
MonthlyCharges      0.193535
tenure              0.172171
Contract            0.141922
OnlineSecurity      0.082919
TechSupport         0.060253
PaymentMethod       0.048733
InternetService     0.045566
PaperlessBilling    0.025386
OnlineBackup        0.025232
dtype: float64

Based on the evaluation metrics, the Optimized Random Forest model is the best-performing model.
The top features contributing to customer churn prediction are:
TotalCharges      0.204281
MonthlyCharges    0.193535
tenure            0.172171
Contract          0.141922
OnlineSecurity    0.0

This code provides a compact solution to the Customer Churn assignment. It loads the data, preprocesses it, trains different models, optimizes the hyperparameters, selects the most important features, and presents the results.

Remember that in a real-world scenario, it's crucial to explore the data and models in more detail and interpret the results accordingly. Additionally, it is recommended to try other advanced classification techniques or address class imbalance issues if applicable to your dataset.
