The following topics are convered in this notebook

* Downloading a real-world dataset from kaggle
* Exploratory data analysis and visualization
* Splitting a dataset into training, valiation and test sets
* Filling/imputing missing values in numeric columns
* Scaling numeric features to a (0,1) range
* Encoding categorical columns as one-hot vactors
* Training a logistic regression model using Scikit-learn
* Evaluating a model using a validataion set and test set
* Saving a model to disk and loading it back

In [1]:
import requests, zipfile, pathlib
from pathlib import Path
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import opendatasets as od

In [2]:
# # Set up our dataset path

# data_path = Path("data/")
# dataset_path = data_path / "weather_Austrailia"

# # If dataset path does not exist, make it
# if dataset_path.is_dir():
#   print(f"{dataset_path} already exists, skipping creation")

# else:
#   print(f"{dataset_path} does not exist, creating one...")
#   dataset_path.mkdir(parents = True, exist_ok = True)

# # Download dataset from github
# with open(data_path / "archive.zip", "wb") as f:
#   request = requests.get(("https://github.com/Musawer-Afzal/ML-with-Scikit-Learn-in-Python/raw/refs/heads/master/Datasets/archive.zip"))
#   print("Downloading Austrailia Weather Dataset")
#   f.write(request.content)

# # Unzip the dataset if zip file
# with zipfile.ZipFile(data_path / "archive.zip", "r") as zip_ref:
#   print("Unzipping Weather dataset")
#   zip_ref.extractall(dataset_path)

In [3]:
# df = pd.read_csv("/content/data/weather_Austrailia/weatherAUS.csv")

In [5]:
# df

In [6]:
# df.info()

In [7]:
# df.describe()

## Classification Problems

It is very important to know the difference between Classification Problem and Regression problem.

One of the hint is that if each of the input must be assigned a discrete category(also called label or class), are known as *classification* problem

Classification problems solution can have a binary answer (yes/no), or can be a multiclass classification (picking one of many classes)

## Regression Problems

Problems where a continuous numeric value must be predicted for each input are known as regression problems

For Example:

* Medical Charges Prediction
* House Price Prediction
* Ocean Temperature Prediction
* Weather Temperature Prediction

### Logistic Regression for Solving Classification Problems

Logistic regression is a commonly used technique for solving binary classification problems. In  a logistic regression model:

* we take linear combination (or weighted sum of the input features)
* we apply the sigmoid function to the result to obtain a number between 0 and 1
* this number represents the probability of the input being classified as "Yes"
* instead of **RMSE**, the cross entropy loss function is used to evaluate the results

### Downloading the Data

We'll use the `opendatasets library` to download the data from kaggle directly within jupyter. Let's install and import `opendatasets`

In [8]:
!pip install opendatasets --upgrade --quiet

In [9]:
od.version()

'0.1.22'

Now we can download the dataset using `od.download`. When we execute `od.download` we'll be asked to provide our kaggle username and API key.

In [10]:
dataset_url = 'https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package'

In [11]:
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: musawerafzal
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package
Downloading weather-dataset-rattle-package.zip to ./weather-dataset-rattle-package


100%|██████████| 3.83M/3.83M [00:00<00:00, 731MB/s]







In [12]:
data_dir = '/content/weather-dataset-rattle-package'

In [13]:
os.listdir(data_dir)

['weatherAUS.csv']

In [14]:
train_csv = data_dir + '/weatherAUS.csv'

In [15]:
train_csv

'/content/weather-dataset-rattle-package/weatherAUS.csv'

In [16]:
raw_df = pd.read_csv(train_csv)

In [17]:
raw_df

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145455,2017-06-21,Uluru,2.8,23.4,0.0,,,E,31.0,SE,...,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No
145456,2017-06-22,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,...,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No
145457,2017-06-23,Uluru,5.4,26.9,0.0,,,N,37.0,SE,...,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No
145458,2017-06-24,Uluru,7.8,27.0,0.0,,,SE,28.0,SSE,...,51.0,24.0,1019.4,1016.5,3.0,2.0,15.1,26.0,No,No


In [18]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

In [19]:
raw_df.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,143975.0,144199.0,142199.0,82670.0,75625.0,135197.0,143693.0,142398.0,142806.0,140953.0,130395.0,130432.0,89572.0,86102.0,143693.0,141851.0
mean,12.194034,23.221348,2.360918,5.468232,7.611178,40.03523,14.043426,18.662657,68.880831,51.539116,1017.64994,1015.255889,4.447461,4.50993,16.990631,21.68339
std,6.398495,7.119049,8.47806,4.193704,3.785483,13.607062,8.915375,8.8098,19.029164,20.795902,7.10653,7.037414,2.887159,2.720357,6.488753,6.93665
min,-8.5,-4.8,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,980.5,977.1,0.0,0.0,-7.2,-5.4
25%,7.6,17.9,0.0,2.6,4.8,31.0,7.0,13.0,57.0,37.0,1012.9,1010.4,1.0,2.0,12.3,16.6
50%,12.0,22.6,0.0,4.8,8.4,39.0,13.0,19.0,70.0,52.0,1017.6,1015.2,5.0,5.0,16.7,21.1
75%,16.9,28.2,0.8,7.4,10.6,48.0,19.0,24.0,83.0,66.0,1022.4,1020.0,7.0,7.0,21.6,26.4
max,33.9,48.1,371.0,145.0,14.5,135.0,130.0,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.7
