# BEGINNER'S TUTORIAL ON SCIKIT-LEARN

In this tutorial, we will explore some common tasks that can be accomplished using scikit-learn, a popular machine learning package in Python. Scikit-learn is known for its simplicity and efficiency in handling various machine learning algorithms. We will cover the following topics:

1. Loading a dataset
2. Splitting the dataset into training, validation, and test sets
3. Training different classification and regression models
4. Finding missing values in the dataset
5. Evaluating model performance using various metrics

By the end of this tutorial, you will have a good understanding of how to use scikit-learn to build and evaluate machine learning models. Let's get started!


## Package Installation & Importation

In [1]:
#execute this cell to install the required packages (if not done already)
# ! pip install scikit-learn numpy pandas kaggle seaborn matplotlib

#### Set Up Kaggle API Credentials
1. Go to Kaggle's website and sign up.
2. Go to "My Account" (click on your profile picture in the top right corner and then on "My Account").
3. Go to "Settings"
4. Scroll down to the "API" section and click on "Create New API Token". This will download a file called kaggle.json.
5. Place the kaggle.json file in the .kaggle directory in your home directory. You can do this with the following commands:
```sh
    mkdir -p ~/.kaggle
    mv /path/to/kaggle.json ~/.kaggle/
    chmod 600 ~/.kaggle/kaggle.json
```

In [92]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, r2_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

# Classification

#### 1. Loading a dataset
The breast cancer dataset, provided by scikit-learn, is a widely used dataset in the field of machine learning and data science. This dataset contains measurements of various features of cell nuclei present in breast cancer biopsies. It is commonly used for binary classification tasks to distinguish between malignant (cancerous) and benign (non-cancerous) tumors.

In [124]:
data = "classification-ds.csv"
df = pd.read_csv(data)

# Display few rows of the dataset
df

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


In [125]:
# Display basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [126]:
# Display summary statistics
df.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,


In [127]:
# Check for missing values
df.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

### Data Preprocessing and Splitting

In [128]:
#Drop columns which are not required
df = df.drop(['Unnamed: 32', 'id'], axis = 1)

# X contains the input attributes, y contains the target/output values
X = df.drop('diagnosis', axis=1)  
y = df['diagnosis']     

# Convert 'diagnosis' column to binary
y = y.map({'M': 1, 'B': 0})

y

0      1
1      1
2      1
3      1
4      1
      ..
564    1
565    1
566    1
567    1
568    0
Name: diagnosis, Length: 569, dtype: int64

#### 2. Splitting the dataset into training and test sets

In [129]:
# Split the data into training and testing sets (70%, 30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes of the splits
X_train.shape, X_test.shape

((455, 30), (114, 30))

#### 3. Training different classification models
In this section, we will demonstrate how to initialize and train different classification models using scikit-learn. While we won't go into the detailed workings of these models, it's important to know that there are multiple algorithms available for classification tasks.


In [130]:
# Initialize and train a Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [131]:
# Initialize and train DecisionTree Model
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)

In [132]:
# Initialize and train SVM Model
svm_clf = SVC()
svm_clf.fit(X_train, y_train)

If you wish to look at the predictions of each model separately, try executing `model_name.predict(X_val)`.

These predictions are then compared to `y_val` for better insigths at how the model is performing.

#### 4. Visualizing the metrics for each model

In [133]:
# Summary of performance metrics
metrics = {
    'Model': ['Logistic Regression', 'Decision Tree', 'SVM'],
    'Accuracy': [accuracy_score(y_test, log_reg.predict(X_test)),
                 accuracy_score(y_test, tree_clf.predict(X_test)),
                 accuracy_score(y_test, svm_clf.predict(X_test))],
    'Precision': [precision_score(y_test, log_reg.predict(X_test)),
                  precision_score(y_test, tree_clf.predict(X_test)),
                  precision_score(y_test, svm_clf.predict(X_test))],
    'Recall': [recall_score(y_test, log_reg.predict(X_test)),
               recall_score(y_test, tree_clf.predict(X_test)),
               recall_score(y_test, svm_clf.predict(X_test))],
    'F1-Score': [f1_score(y_test, log_reg.predict(X_test)),
                 f1_score(y_test, tree_clf.predict(X_test)),
                 f1_score(y_test, svm_clf.predict(X_test))]
}

metrics_df = pd.DataFrame(metrics)
metrics_df

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score
0,Logistic Regression,0.964912,0.97561,0.930233,0.952381
1,Decision Tree,0.947368,0.930233,0.930233,0.930233
2,SVM,0.947368,1.0,0.860465,0.925


Based on the performance metrics, it appears that the **Logistic Regression** model is the best fit for this data. It achieved the highest accuracy, precision , recall, and F1-score are also superior compared to the other models.

# Regression

In [134]:
df = pd.read_csv("regression-ds.csv")
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [135]:
# Define the target variable and features
target = 'SalePrice'
features = df.drop(columns=[target])
features.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal


In [136]:
# Drop rows with missing target values
df = df.dropna(subset=[target])
df.isnull().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

In [137]:
# drop columns with all NaN's
df = df.dropna(axis=1)
df.isnull().sum()

Id               0
MSSubClass       0
MSZoning         0
LotArea          0
Street           0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 62, dtype: int64

In [138]:
X = df.drop(columns=[target])
y = df[target]

# Identify numerical columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Preprocess the data: scale numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X[numerical_features])

In [139]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

In [140]:
# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred

array([154007.87984183, 309582.76456264, 115031.07504977, 181001.39594481,
       301445.33570323,  46438.45407551, 227366.70122798, 149541.83310126,
        43500.65139432, 151855.01171511, 156845.64619036, 114084.71472898,
        82221.7697666 , 211031.07504977, 191124.96188144, 141818.82480421,
       215382.82480421, 136011.13529134, 117834.82480421, 233886.51431708,
       185896.19032896, 216766.26716462, 193919.50911313, 132166.45407551,
       215269.33570323, 153495.32220224, 200237.89334283,  90862.14358838,
       186342.08025372, 181711.13529134, 116702.63789332, 273065.7697666 ,
       230871.44577847,  89671.01171511, 270942.51431708, 164518.08025372,
       150102.57765175, 220310.20382996, 308373.08545768,  91180.52781808,
       133429.70643193, 253326.45407551, 103478.20382996, 278093.64619036,
       130917.27236857, 131841.57480421, 103605.08545768, 131812.06675272,
       359454.26716462, 128166.20382996, 106166.82480421, 219046.57765175,
        93918.39074085, 3

In [141]:
# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [142]:
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

Mean Squared Error (MSE): 1265846048.73
R-squared (R²): 0.82


In [143]:
if r2 > 0.8:
    print("The model explains a high proportion of the variance in house prices, suggesting a strong fit.")
elif r2 > 0.5:
    print("The model explains a moderate proportion of the variance in house prices, indicating a reasonable fit.")
else:
    print("The model explains a low proportion of the variance in house prices, indicating that it may not fit the data well.")

The model explains a high proportion of the variance in house prices, suggesting a strong fit.


# That's the end of this notebook, hope you had a fun learning experience!

## Tasks

1. Perform min-max scaling on a these set of values `[171, 120, 86, 176, 77]`. The input values range from 32 to 212 and the output should range from 0 to 100.

1. Perform standardization (z-score normalization) on a dataset with the following values: `[50, 60, 70, 80, 90]`. Ensure the transformed values have a mean of 0 and a standard deviation of 1.

1. Convert the categorical labels `['cat', 'dog', 'fish', 'cat', 'dog']` into numerical labels using label encoding.

1. Apply one-hot encoding to the categorical variable `['apple', 'banana', 'orange', 'banana', 'banana', 'apple', orange', 'orange']`.

1. Split the dataset `X = [[1], [2], [3], [4], [5], [6], [7], [8]]` and `y = [10, 20, 30, 40, 50, 60, 70, 80]` into training and testing sets with a test size of 25%.

1. Generate a confusion matrix for the true labels `[1, 0, 1, 1, 0]` and the predicted labels `[1, 0, 0, 1, 1]`.

1. Train a random forest model on the glass classification dataset and output the importance of each feature.

1. Train a support vector regression model on the possum dataset and predict the age of a possum.