# ML with Python: Course Introduction

MTU Spring 26

Instructor: Amna Mazen

## Learning outcomes
By the end of this tutorial, you will be able to

- Run Python code in Google Colab.
- Interpret tabular data, including features, examples, and targets.
- Use Pandas to load and inspect datasets.
- Train simple supervised machine learning models for classification and regression.

**Attention!!**  
You are welcome to ask questions by raising your hand.  

There is also [a reflection Google Document](https://docs.google.com/document/d/1_IDMz9UJOQ1XZid259TLgQHRuwpALu8oImAGGiUpbRE/edit?usp=sharing) for this course where you can share your questions, comments, and reflections. It would be great if you could write about your takeaways, struggle points, and general comments in this document, so I can address them in the next lecture.


## What is ML?
<!-- <img src="img/ml_stat.jpg" height="1000" width="1000">  -->



Machine learning is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can effectively generalize and thus perform tasks without explicit instructions.

### Prevalence of ML

Let's look at some examples.


![Machine Learning Examples](https://raw.githubusercontent.com/MazenLabMTU/EET4501---SP25/main/notebooks/img/ml-examples.png)



- Image sources
    - [Voice assistants](https://geeksfl.com/blog/best-voice-assistant/)
    - [Google News](https://news.google.com)    
    - [Recommendation systems](https://en.wikipedia.org/wiki/Recommender_system)
    - [Face Recognition source](https://startupleague.online/blog/3dss-tech-facial-recognition-technology/)
    - [Auto-completion](https://9to5google.com/2020/08/10/android-11-autofill-keyboard/)
    - [Stock market prediction](https://hbr.org/2019/12/what-machine-learning-will-mean-for-asset-managers)    
    - [Character recognition](https://en.wikipedia.org/wiki/Handwriting_recognition)    
    - [AlphaGo](https://deepmind.com/alphago-china)
    - [Self-driving cars](https://mc.ai/artificial-intelligence-in-self-driving-cars%E2%80%8A-%E2%80%8Ahow-far-have-we-gotten/)
    - [Drug discovery](https://www.nature.com/articles/d41586-018-05267-x)
    - [Cancer detection](https://venturebeat.com/2018/10/12/google-ai-claims-99-accuracy-in-metastatic-breast-cancer-detection/)

### Saving time and scaling products

- Imagine writing a program for spam identification, i.e., whether an email is spam or non-spam.
- Traditional programming
    - Come up with rules using human understanding of spam messages.
    - Time consuming and hard to come up with robust set of rules.
- Machine learning
    - Collect large amount of data of spam and non-spam emails and let the machine learning algorithm figure out rules.
- With machine learning, you're likely to
    - Save time
    - Customize and scale products

### Types of machine learning

Here are some typical learning problems.

- **Supervised learning** ([Gmail spam filtering](https://support.google.com/a/answer/2368132?hl=en))
    - Training a model from input data and its corresponding targets to predict targets for new examples.     
- **Unsupervised learning** ([Google News](https://news.google.com/))
    - Training a model to find patterns in a dataset, typically an unlabeled dataset.
- **Reinforcement learning** ([AlphaGo](https://deepmind.com/research/case-studies/alphago-the-story-so-far))
    - A family of algorithms for finding suitable actions to take in a given situation in order to maximize a reward.


## Supervised machine learning

### What is supervised machine learning (ML)?

- Training data comprises a set of observations ($X$) and their corresponding targets ($y$).
- We wish to find a model function $f$ that relates $X$ to $y$.
- We use the model function to predict targets of new examples.

![Machine Learning Examples](https://raw.githubusercontent.com/MazenLabMTU/EET4501---SP25/main/notebooks/img/sup-learning.png)



### Tabular data
In supervised machine learning, the input data is typically organized in a **tabular** format, where rows are **examples** and columns are **features**. One of the columns is typically the **target**.

**Features**
: Features are relevant characteristics of the problem, usually suggested by experts. Features are typically denoted by $X$ and the number of features is usually denoted by $d$.  

**Target**
: Target is the feature we want to predict (typically denoted by $y$).

**Example**
: A row of feature values. When people refer to an example, it may or may not include the target corresponding to the feature values, depending upon the context. The number of examples is usually denoted by $n$.

**Training**
: The process of learning the mapping between the features ($X$) and the target ($y$).

#### Alternative terminology for examples, features, targets, and training

- **examples** = rows = samples = records = instances
- **features** = inputs = predictors = explanatory variables = regressors = independent variables = covariates
- **targets** = outputs = outcomes = response variable = dependent variable = labels (if categorical).
- **training** = learning = fitting

## Example 1: Predict whether a message is spam or not (Classification)

#### Input features $X$ and target $y$

> **Note:**  
> Do not worry about the code and syntax for now.


Download SMS Spam Collection Dataset from [here](https://www.kaggle.com/uciml/sms-spam-collection-dataset).


### Import Necessary Libraries

In [None]:
import glob
import os
import re
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#import graphviz
import IPython
#import mglearn
from IPython.display import HTML, display
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz

plt.rcParams["font.size"] = 16
pd.set_option("display.max_colwidth", 200)

### Training a supervised machine learning model with $X$ and $y$

In [None]:
##Option1: Upload to Google Colab (temporary)
#file_path = '/content/spam.csv'

##Option2: Upload to Google Drive
#from google.colab import drive
#drive.mount('/content/drive')
#file_path = '/content/drive/MyDrive/MTU Teaching Courses/Spring 2025/Applied Machine Learning/week_1/Notebooks/spam.csv'

##Option3: GitHub
file_path = 'https://raw.githubusercontent.com/MazenMTULab/ML_COURSE_RESOURCES/main/Dataset/Lecture_Dataset/spam.csv'

sms_df = pd.read_csv(file_path, encoding="latin-1")
HTML(sms_df.head().to_html(index=False))

v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",,,
ham,Ok lar... Joking wif u oni...,,,
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,,,
ham,U dun say so early hor... U c already then say...,,,
ham,"Nah I don't think he goes to usf, he lives around here though",,,


In [None]:

sms_df = sms_df.drop(columns = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})
train_df, test_df = train_test_split(sms_df, test_size=0.10, random_state=42)
HTML(train_df.head().to_html(index=False))

target,sms
spam,"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."
ham,"Aight, I'll hit you up when I get some cash"
ham,Don no da:)whats you plan?
ham,Going to take your babe out ?
ham,No need lar. Jus testing e phone card. Dunno network not gd i thk. Me waiting 4 my sis 2 finish bathing so i can bathe. Dun disturb u liao u cleaning ur room.


In [None]:
sms_df.shape

(5572, 2)

In [None]:
X_train, y_train = train_df["sms"], train_df["target"]
X_test, y_test = test_df["sms"], test_df["target"]

clf = Pipeline(
    [
        ("vect", CountVectorizer(max_features=5000)),
        ("clf", LogisticRegression(max_iter=5000)),
    ]
)
clf.fit(X_train, y_train);

### Predicting on unseen data using the trained model

In [None]:
pd.DataFrame(X_test[0:4])

Unnamed: 0,sms
3245,"Funny fact Nobody teaches volcanoes 2 erupt, tsunamis 2 arise, hurricanes 2 sway aroundn no 1 teaches hw 2 choose a wife Natural disasters just happens"
944,"I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one m..."
1044,"We know someone who you know that fancies you. Call 09058097218 to find out who. POBox 6, LS15HB 150p"
2484,Only if you promise your getting out as SOON as you can. And you'll text me in the morning to let me know you made it in ok.


In [None]:
pd.DataFrame(y_test[0:4])

Unnamed: 0,target
3245,ham
944,ham
1044,spam
2484,ham


In [None]:
pred_dict = {
    "sms": X_test[0:4],
    "spam_predictions": clf.predict(X_test[0:4]),
}
pred_df = pd.DataFrame(pred_dict)
pred_df.style.set_properties(**{"text-align": "left"})

Unnamed: 0,sms,spam_predictions
3245,"Funny fact Nobody teaches volcanoes 2 erupt, tsunamis 2 arise, hurricanes 2 sway aroundn no 1 teaches hw 2 choose a wife Natural disasters just happens",ham
944,"I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones",ham
1044,"We know someone who you know that fancies you. Call 09058097218 to find out who. POBox 6, LS15HB 150p",spam
2484,Only if you promise your getting out as SOON as you can. And you'll text me in the morning to let me know you made it in ok.,ham


**We have accurately predicted labels for the unseen text messages above!**

### (Supervised) machine learning: popular definition
<blockquote>
A field of study that gives computers the ability to learn without being explicitly programmed. <br> -- Arthur Samuel (1959)
</blockquote>

ML is a different way to think about problem solving.


### Option 4: Get dataset from Kaggle
✅ Step 1 — Upload your Kaggle API key



1.   Go to https://www.kaggle.com/ → Sign in
2.   Click on your profile → Settings → Account
3. Scroll to API → Click **Create API Legacy Key**. This will download kaggle.json

4. In Google Colab, upload the kaggle.json file:

✅ Step 2 — Configure Kaggle in Colab

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

✅ Step 3 — Download the Dataset

You can now download the SMS Spam dataset using the Kaggle API:

This command uses the Kaggle API to download a dataset from Kaggle directly to your machine or Google Colab environment.

Breakdown of each part

**kaggle:** Invokes the Kaggle command-line tool (Kaggle API).

**datasets:** Specifies that you are working with Kaggle datasets (not competitions or notebooks).

**download:** Tells Kaggle you want to download a dataset.

**-d:** Stands for dataset identifier.

**uciml/sms-spam-collection-dataset**
The unique Kaggle ID of the dataset:

uciml → dataset owner

sms-spam-collection-dataset → dataset name

In [None]:
!kaggle datasets download -d uciml/sms-spam-collection-dataset

Dataset URL: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
License(s): unknown
Downloading sms-spam-collection-dataset.zip to /content
  0% 0.00/211k [00:00<?, ?B/s]
100% 211k/211k [00:00<00:00, 517MB/s]


✅ Step 4 — Unzip and Load into Pandas

In [None]:
!unzip sms-spam-collection-dataset.zip

Archive:  sms-spam-collection-dataset.zip
  inflating: spam.csv                


In [None]:
file_path = '/content/spam.csv'


sms_df = pd.read_csv(file_path, encoding="latin-1")
sms_df = sms_df.drop(columns = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})
train_df, test_df = train_test_split(sms_df, test_size=0.10, random_state=42)
HTML(train_df.head().to_html(index=False))

target,sms
spam,"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."
ham,"Aight, I'll hit you up when I get some cash"
ham,Don no da:)whats you plan?
ham,Going to take your babe out ?
ham,No need lar. Jus testing e phone card. Dunno network not gd i thk. Me waiting 4 my sis 2 finish bathing so i can bathe. Dun disturb u liao u cleaning ur room.


## Example 2: Predicting whether a patient has a liver disease or not (Classification)


##### Input data

Suppose we are interested in predicting whether a patient has the disease or not. We are given some tabular data with inputs and outputs of liver patients, as shown below. The data contains a number of input features and a special column called "Target" which is the output we are interested in predicting.


You can download the data from [SPAM dataset Link](https://www.kaggle.com/uciml/indian-liver-patient-records). You don't need to download since we will get it from the GitHub Repo.


In [None]:
file_path = 'https://raw.githubusercontent.com/MazenMTULab/ML_COURSE_RESOURCES/main/Dataset/Lecture_Dataset/indian_liver_patient.csv'
df = pd.read_csv(file_path)
df = df.drop(columns = ["Gender"])
df["Dataset"] = df["Dataset"].replace(1, "Disease")
df["Dataset"] = df["Dataset"].replace(2, "No Disease")
df.rename(columns={"Dataset": "Target"}, inplace=True)
train_df, test_df = train_test_split(df, test_size=4, random_state=42)
HTML(train_df.head().to_html(index=False))

Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Target
40,14.5,6.4,358,50,75,5.7,2.1,0.5,Disease
33,0.7,0.2,256,21,30,8.5,3.9,0.8,Disease
24,0.7,0.2,188,11,10,5.5,2.3,0.71,No Disease
60,0.7,0.2,171,31,26,7.0,3.5,1.0,No Disease
18,0.8,0.2,199,34,31,6.5,3.5,1.16,No Disease


### Building a supervise machine learning model

Let's train a supervised machine learning model with the input and output above.

In [None]:
from lightgbm.sklearn import LGBMClassifier

X_train = train_df.drop(columns=["Target"])
y_train = train_df["Target"]
X_test = test_df.drop(columns=["Target"])
y_test = test_df["Target"]
model = LGBMClassifier(random_state=123)
model.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 166, number of negative: 413
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000111 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 678
[LightGBM] [Info] Number of data points in the train set: 579, number of used features: 9
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.286701 -> initscore=-0.911460
[LightGBM] [Info] Start training from score -0.911460


### Model predictions on unseen data

- Given features of new patients below we'll use this model to predict whether these patients have the liver disease or not.

In [None]:
HTML(X_test.reset_index(drop=True).to_html(index=False))

Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
19,1.4,0.8,178,13,26,8.0,4.6,1.3
12,1.0,0.2,719,157,108,7.2,3.7,1.0
60,5.7,2.8,214,412,850,7.3,3.2,0.78
42,0.5,0.1,162,155,108,8.1,4.0,0.9


In [None]:
y_test

Unnamed: 0,Target
355,No Disease
407,Disease
90,Disease
402,Disease


In [None]:
pred_df = pd.DataFrame({"Predicted_target": model.predict(X_test).tolist()})

df_concat = pd.concat([pred_df, X_test.reset_index(drop=True)], axis=1)
HTML(df_concat.to_html(index=False))

Predicted_target,Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
No Disease,19,1.4,0.8,178,13,26,8.0,4.6,1.3
Disease,12,1.0,0.2,719,157,108,7.2,3.7,1.0
Disease,60,5.7,2.8,214,412,850,7.3,3.2,0.78
Disease,42,0.5,0.1,162,155,108,8.1,4.0,0.9


<br><br>

## Example 3: Predicting housing prices

Suppose we want to predict housing prices given a number of attributes associated with houses.


You can download the data from [House Price dataset Link](hhttps://www.kaggle.com/harlfoxem/housesalesprediction). You don't need to download since we will get it from the GitHub Repo.


In [None]:
file_path = 'https://raw.githubusercontent.com/MazenMTULab/ML_COURSE_RESOURCES/main/Dataset/Lecture_Dataset/kc_house_data.csv'
df = pd.read_csv(file_path)
df = df.drop(columns = ["id", "date"])
df.rename(columns={"price": "target"}, inplace=True)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=4)
HTML(train_df.head().to_html(index=False))

target,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
509000.0,2,1.5,1930,3521,2.0,0,0,3,8,1930,0,1989,0,98007,47.6092,-122.146,1840,3576
675000.0,5,2.75,2570,12906,2.0,0,0,3,8,2570,0,1987,0,98075,47.5814,-122.05,2580,12927
420000.0,3,1.0,1150,5120,1.0,0,0,4,6,800,350,1946,0,98116,47.5588,-122.392,1220,5120
680000.0,8,2.75,2530,4800,2.0,0,0,4,7,1390,1140,1901,0,98112,47.6241,-122.305,1540,4800
357823.0,3,1.5,1240,9196,1.0,0,0,3,8,1240,0,1968,0,98072,47.7562,-122.094,1690,10800


In [None]:
# Build a regression model
from lightgbm.sklearn import LGBMRegressor

X_train, y_train = train_df.drop(columns= ["target"]), train_df["target"]
X_test, y_test = test_df.drop(columns= ["target"]), train_df["target"]

model = LGBMRegressor()
#model = XGBRegressor()
model.fit(X_train, y_train);

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005777 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2333
[LightGBM] [Info] Number of data points in the train set: 17290, number of used features: 18
[LightGBM] [Info] Start training from score 539762.702545


In [None]:
y_test[0:4]

Unnamed: 0,target
8583,509000.0
19257,675000.0
1295,420000.0
15670,680000.0


In [None]:
# Predict on unseen examples using the built model
pred_df = pd.DataFrame(
    # {"Predicted target": model.predict(X_test[0:4]).tolist(), "Actual price": y_test[0:4].tolist()}
    {"Predicted_target": model.predict(X_test[0:4]).tolist()}
)
df_concat = pd.concat([pred_df, X_test[0:4].reset_index(drop=True)], axis=1)
HTML(df_concat.to_html(index=False))

Predicted_target,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
345831.740542,4,2.25,2130,8078,1.0,0,0,4,7,1380,750,1977,0,98055,47.4482,-122.209,2300,8112
601042.018745,3,2.5,2210,7620,2.0,0,0,3,8,2210,0,1994,0,98052,47.6938,-122.13,1920,7440
311310.186024,4,1.5,1800,9576,1.0,0,0,4,7,1800,0,1977,0,98045,47.4664,-121.747,1370,9576
597555.592401,3,2.5,1580,1321,2.0,0,2,3,8,1080,500,2014,0,98107,47.6688,-122.402,1530,1357


To summarize, supervised machine learning can be used on a variety of problems and different kinds of data.

### Machine learning workflow

Supervised machine learning is quite flexible; it can be used on a variety of problems and different kinds of data. Here is a typical workflow of a supervised machine learning systems.  

![](img/ml-workflow.png)

<!-- <img src="img/ml-workflow.png" height="800" width="800">  -->

We will build machine learning pipelines in this course, focusing on some of the steps above.

### Python requirements/resources

We will primarily use Python in this course.

Here is the basic Python knowledge you'll need for the course:

- Basic Python programming
- Numpy
- Pandas
- Basic matplotlib
- Sparse matrices

Quiz 1 is all about Python.


We do not have time to teach all the Python we need
but you can find some [useful Python resources](https://github.com/areyesan/SAT4310-F23).  


## Summary

- Machine learning is a different paradigm for problem solving.    
- Very often it reduces the time you spend programming and helps customizing and scaling your products.
- In supervised learning we are given a set of observations ($X$) and their corresponding targets ($y$) and we wish to find a model function $f$ that relates $X$ to $y$.


