## **BEGINNER'S TUTORIAL ON SCIKIT-LEARN**

In this tutorial, we will explore some common tasks that can be accomplished using scikit-learn, a popular machine learning package in Python. Scikit-learn is known for its simplicity and efficiency in handling various machine learning algorithms. We will cover the following topics:

1. Loading a dataset
2. Splitting the dataset into training, validation, and test sets
3. Training different classification and regression models
4. Finding missing values in the dataset
5. Evaluating model performance using various metrics

By the end of this tutorial, you will have a good understanding of how to use scikit-learn to build and evaluate machine learning models. Let's get started!


### Package Installation & Importation

In [104]:
#execute this cell to install the required packages (if not done already)
%pip install scikit-learn numpy pandas

Note: you may need to restart the kernel to use updated packages.


#### Install Kaggle API
You need to have the Kaggle API installed. You can install it using pip:

In [105]:
import os
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


#### Set Up Kaggle API Credentials
1. Go to Kaggle's website and sign up.
2. Go to "My Account" (click on your profile picture in the top right corner and then on "My Account").
3. Go to "Settings"
4. Scroll down to the "API" section and click on "Create New API Token". This will download a file called kaggle.json.
5. Place the kaggle.json file in the .kaggle directory in your home directory. You can do this with the following commands:
```sh
    mkdir -p ~/.kaggle
    mv /path/to/kaggle.json ~/.kaggle/
    chmod 600 ~/.kaggle/kaggle.json
```

#### Joining Relevant Competition on Kaggle
1. Make sure you have logged into kaggle with the same account the API key has been generated for.
2. Go to https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques and click on "Join Competition".
3. This way, the regression dataset (House Price Prediction) will download seamlessly. 

# Classification

#### Download the Dataset
Now you can use the Kaggle API to download the dataset:

In [106]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load the dataset as a pandas DataFrame
data = load_breast_cancer(as_frame=True)
df = pd.concat([data.data, data.target.rename('target')], axis=1)

print("First 5 rows of the dataset:")
df.head()

print("\nClass distribution (0 = malignant, 1 = benign):")
print(df['target'].value_counts())


First 5 rows of the dataset:

Class distribution (0 = malignant, 1 = benign):
target
1    357
0    212
Name: count, dtype: int64


In [107]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

## Classification

#### 1. Loading a dataset
The breast cancer dataset, provided by scikit-learn, is a widely used dataset in the field of machine learning and data science. This dataset contains measurements of various features of cell nuclei present in breast cancer biopsies. It is commonly used for binary classification tasks to distinguish between malignant (cancerous) and benign (non-cancerous) tumors.

In [108]:
data = load_breast_cancer(as_frame=True)
df = pd.concat([data.data, data.target.rename('target')], axis=1)

print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [109]:
# Display basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

#### 2. Splitting the dataset into training, validation, and test sets


In [110]:
# Display summary statistics
print("Display summary statistics: \n",df.describe())

Display summary statistics: 
        mean radius  mean texture  mean perimeter    mean area  \
count   569.000000    569.000000      569.000000   569.000000   
mean     14.127292     19.289649       91.969033   654.889104   
std       3.524049      4.301036       24.298981   351.914129   
min       6.981000      9.710000       43.790000   143.500000   
25%      11.700000     16.170000       75.170000   420.300000   
50%      13.370000     18.840000       86.240000   551.100000   
75%      15.780000     21.800000      104.100000   782.700000   
max      28.110000     39.280000      188.500000  2501.000000   

       mean smoothness  mean compactness  mean concavity  mean concave points  \
count       569.000000        569.000000      569.000000           569.000000   
mean          0.096360          0.104341        0.088799             0.048919   
std           0.014064          0.052813        0.079720             0.038803   
min           0.052630          0.019380        0.000000    

In [111]:
# Check for missing values
print("missing values:\n", df.isnull().sum())

missing values:
 mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64


### Data Preprocessing and Splitting

In [112]:
# Features (X) and target (y)
X = df.drop('target', axis=1)  
y = df['target']   # already 0 (malignant) / 1 (benign)

print("Target values (0 = malignant, 1 = benign):")
print(y.head(10))


Target values (0 = malignant, 1 = benign):
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: target, dtype: int64


In [113]:
from sklearn.model_selection import train_test_split

# Split the data into training, validation, and test sets (70%, 10%, 20%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.66, random_state=42)

# Check the shapes of the splits
X_train.shape, X_val.shape, X_test.shape

((398, 30), (58, 30), (113, 30))

#### 3. Training different classification models
In this section, we will demonstrate how to initialize and train different classification models using scikit-learn. While we won't go into the detailed workings of these models, it's important to know that there are multiple algorithms available for classification tasks.


In [114]:
# Initialize and train a Logistic Regression Model
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,10000


In [115]:
# Initialize and train DecisionTree Model
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,42
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [116]:
# Initialize and train SVM Model
svm_clf = SVC(random_state=42)
svm_clf.fit(X_train, y_train)

0,1,2
,C,1.0
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


If you wish to look at the predictions of each model separately, try executing `model_name.predict(X_val)`.

These predictions are then compared to `y_val` for better insigths at how the model is performing.

#### 4. Visualizing the metrics for each model

In [117]:
# Summary of performance metrics
metrics = {
    'Model': ['Logistic Regression', 'Decision Tree', 'SVM'],
    'Accuracy': [accuracy_score(y_val, log_reg.predict(X_val)),
                 accuracy_score(y_val, tree_clf.predict(X_val)),
                 accuracy_score(y_val, svm_clf.predict(X_val))],
    'Precision': [precision_score(y_val, log_reg.predict(X_val)),
                  precision_score(y_val, tree_clf.predict(X_val)),
                  precision_score(y_val, svm_clf.predict(X_val))],
    'Recall': [recall_score(y_val, log_reg.predict(X_val)),
               recall_score(y_val, tree_clf.predict(X_val)),
               recall_score(y_val, svm_clf.predict(X_val))],
    'F1-Score': [f1_score(y_val, log_reg.predict(X_val)),
                 f1_score(y_val, tree_clf.predict(X_val)),
                 f1_score(y_val, svm_clf.predict(X_val))]
}

metrics_df = pd.DataFrame(metrics)
metrics_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score
0,Logistic Regression,0.965517,0.944444,1.0,0.971429
1,Decision Tree,0.931034,0.96875,0.911765,0.939394
2,SVM,0.913793,0.871795,1.0,0.931507


Based on the performance metrics, it appears that the **Logistic Regression** model is the best fit for this data. It achieved the highest accuracy, precision , recall, and F1-score are also superior compared to the other models.

# Regression

In [118]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load dataset
california = fetch_california_housing(as_frame=True)
df = california.frame

print("Dataset shape:", df.shape)

Dataset shape: (20640, 9)


In [119]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [120]:
# Define the target variable and features
target = 'MedHouseVal'
features = df.drop(columns=[target])
y = df[target]

print("Features shape:", features.shape)
print("Target shape:", y.shape)
features.head()

Features shape: (20640, 8)
Target shape: (20640,)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [121]:
# Drop rows with missing target values
df = df.dropna(subset=[target])
df.isnull().sum()

MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64

In [122]:
# drop columns with all NaN's
df = df.dropna(axis=1)
df.isnull().sum()

MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64

In [123]:
X = df.drop(columns=[target])
y = df[target]

# Identify numerical columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocess the data: scale numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X[numerical_features])

In [124]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [125]:
# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred[:10]

array([0.71912284, 1.76401657, 2.70965883, 2.83892593, 2.60465725,
       2.01175367, 2.64550005, 2.16875532, 2.74074644, 3.91561473])

In [126]:
# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [127]:
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

Mean Squared Error (MSE): 0.56
R-squared (R²): 0.58


In [128]:
if r2 > 0.8:
    print("The model explains a high proportion of the variance in house prices, suggesting a strong fit.")
elif r2 > 0.5:
    print("The model explains a moderate proportion of the variance in house prices, indicating a reasonable fit.")
else:
    print("The model explains a low proportion of the variance in house prices, indicating that it may not fit the data well.")

The model explains a moderate proportion of the variance in house prices, indicating a reasonable fit.


## That's the end of this notebook, hope you had a fun learning experience!