<a href="https://www.kaggle.com/code/chriszhengao/gdsc-upm-ml-workshop-regression-prediction?scriptVersionId=155171880" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Welcome to the **GDSC UPM** *Machine Learning* Workshop!🎊🎊   
   
   ![image.png](attachment:5b6297d9-dd35-4d0c-93b8-aa3aee45a4cb.png)  
     
-----------------------
   
## This is going to be your very first step into the world of machine learning by building    
## "*Your First Prediction*"!
   
🎊This is a practical python notebook as an introduction to lead you to finish a very simple machine learning project.  
   
❓Follow by the questions, code, dataset listed below, you will be discovering ML with us step by step.
   
⛵Don't worried too much, just follow up!

----------------------------------------------

## Little information about the platform... 🦤   
   
   ![image.png](attachment:dfd1c72e-a361-417b-b941-8f964e09ac05.png)

🖥️***Kaggle*** is one of the most famous data science and machine learning platform in the world,   
   
🧑‍💻it's got plenty of *data scientists* and *developers* sharing their solutions with code, datasets, their ways of dealing with data,   
   
✍🏻it's always a good way to explore by yourself, learning from absoring other's ideas to improve your skills!   
   
🌁Moreover, with **online developing environment**, we can easily, directly use online resources to **code**!   
> *This Python 3 environment comes with many helpful analytics libraries installed   
It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python*   
   
> *You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"   
You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session*

-------------------------------------------------------------------

## Let's talk about the dataset we are going to play with today...🌀   
   
![image.png](attachment:99f3a7d5-4810-4fb8-b819-e4635c1f4717.png)
   
   
🔬Researchers, *Xiaoqin Lu, Hui Yu, Xiaoming Yang and Xiaofeng Li, 2017*, published a paper called:  

🛰️"*Estimating Tropical Cyclone Size in the Northwestern Pacific from Geostationary Satellite Infrared Images*"   
   
> As the title explains, they observed images from satellites and record the *size* for each tropical cyclone.   
   
### What is a *tropical cyclone*?   
   
> ☠️From 1998-2017, *storms*, including *tropical cyclones* and *hurricanes*, were second only to earthquakes in terms of fatalities, killing **233,000 people**.  (WHO, 2017)   
   
> 🌪️Typhoons can generate winds of more than **75 miles per hour(120KM/H)** and cause **flooding**, **rainfall** and **storm**. (Erik Devaney, 2018)   
   
   
   
**Now, it is important for us to figure out how we can utilize the technique of machine learning to predict the size of those climate disasters to help more people out and prevent from getting more lost!**

-------------------------------------------------------------------------------------

## Preview of our dataset by using Python🐍   

Since we are currently under the dataset that I previously uploaded and processed,   
   
The CSV file is listed just under the **Notebook** section in the right panel, under **Input**, click the ↓, "*tropical_cyclone_size.csv*"，is our dataset file!   
> *Input data files are available in the read-only "../input/" directory*   
   
![image.png](attachment:46265dae-0db6-4269-a216-c256f8a8c699.png)

Read the dataset, by clicking **run** the code block below   
   ![image.png](attachment:597440f7-0ee5-4b8d-8603-5e79674d6902.png)   
      
**If you are new, don't forget to read the comments in the code block!**

In [1]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

![image.png](attachment:faa23f1f-d7f0-4249-8f3c-27fefbfd1ff1.png)

In [2]:
# this is for read the csv file using pandas, move your mouse on the file, it will pop up to let you copy the file path!
df = pd.read_csv('/kaggle/input/tropical-cyclone-size-in-the-northwestern-pacific/tropical_cyclone_size.csv', index_col = 0)
# this method allow you to preview the first 5 lines of the dataset
df.head()

Unnamed: 0_level_0,Time,Latitude,Longitude,Pressure,Wind Speed,SiR34,SATSer
Cyclone Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8003,1980-04-06 06:00:00,15.97,177.2,987,27.5,182.9,GOE-3
8003,1980-04-06 18:00:00,17.9,178.28,987,25.9,140.8,GOE-3
8003,1980-04-07 00:00:00,18.75,179.01,992,22.2,111.6,GOE-3
8003,1980-04-07 06:00:00,19.4,179.68,994,22.1,125.5,GOE-3
8015,1980-08-08 12:00:00,14.15,157.04,996,17.6,139.8,GOE-3


### Explaination about the column names   
   
| Field   | Description                                      |
|---------|--------------------------------------------------|
| Cyclone Numer    | Unique internation ID number for each tropical cyclone                                             |
| Time  | YYYY-MM-DD-HH:MM:SS |
| Latitude     | Latitude of the tropical cyclone center|
| Longitude    | Longitude of the tropical cyclone center|
| Pressure     | Minimum central pressure of the tropical cyclone |
| Wind Speed     | Maximum sustained wind speed near the tropical cyclone center |
| SiR34   | Scale of the tropical cyclone (km, based on the 34-knot wind radius) |
| SATSer  | Satellite used for inversion, including GOES-1 to 13|   

------------------------------------

## *Remember, it's always very important in ML to understand your data!*   
   
Since I have done the data processing before,   
if you are still interested in data preprocessing,   
please refer to the notebook I uploaded before:  
[Data Processing for Tropical Cyclone Size Dataset](https://www.kaggle.com/code/chriszhengao/data-processing-for-tropical-cyclone-size-dataset)

In [3]:
# to find out how many rows do we have in our dataset
df.shape[0]

14997

In [4]:
# to find out wheter we got some of the rows that have NULL value
df.isnull().sum().sum()

0

* *What else you want to try?*    
   
* *To check if your dataset got something wrong?*   
   
* *Use any methods, online resources to find out and test by yourself by simply creating code blocks after that!*   
   

**Here are some simple example questions for you to try**:   
   
1. *Mean?*   
2. *Std. Deviation?*   
3. *Max?*   
4. *Min?*   
5. *Median?*   
...

In [5]:
# For example
# mean_value = df["SiR34"].mean()

### *Normally, visualization is a better way to help you to gain deeper understanding of your data*   
   
For each of the column, please refer to the data card on: [Tropical Cyclone Size in the Northwestern Pacific Dataset Kaggle Page](https://www.kaggle.com/datasets/chriszhengao/tropical-cyclone-size-in-the-northwestern-pacific)   
   
![image.png](attachment:bfbb5d00-aeb8-49b8-b25d-414e47711d2a.png)   
   
*For beginner, exploring the dataset in the data card will give you a such comprehensive understanding of your dataset.*   
Kaggle allows user to hover on the bars to show more information,   
and also the general trends of each columns.   
   
   
![image.png](attachment:e38aa70e-04bc-46ff-9fdb-75965900073f.png)   
   
**Thankfully, we don't have any Missmatched and Missing value!**

#### **For example**   
   
The most **Latitude** counts fall into the range of **17.97 - 20.32**   
   
The most **Longitude** counts fall into the range of **124.11 - 128.08**   
   
*Which means in these area, experienced most tropical cyclones, right?*   
### *🗺️Let's find out where is it on the map.🗺️*

In [6]:
# folium is a map visulization library in python
import folium

# the location means the zoom location, and zoom_start means the initial zoom level
maps = folium.Map(location=[19, 126], zoom_start=5)

# we want to show the locations where had most cyclone, then we add circle to display it
folium.Circle(
    location=[17.97, 124.11], # the left side Latitude with left side Longitude
    radius= 400000, # in meters, random chose number for display purpose only, we will estimate the size later on
    color='red',
    fill=True,
    fill_color='red',
    fill_opacity=0.2
).add_to(maps) # add to the maps

# second circle
folium.Circle(
    location=[20.32, 128.08], # the right side Latitude with right side Longitude
    radius= 400000, # in meters, random chose number for display purpose only, we will estimate the size later on
    color='red',
    fill=True,
    fill_color='red',
    fill_opacity=0.2
).add_to(maps) # add to the maps

maps # display 2 circles together/ display all the components attached on the map

---------------------------------

## 🏃*Now, shall we begin to bulid our Machine Learning model?*

### Machine learning can be categorized into three main types:

> 1. **Supervised Learning**: *In supervised learning, the algorithm is trained on a labeled dataset, where input data and their corresponding target outputs are provided. The goal is to learn a mapping from inputs to outputs, making it suitable for tasks like classification and regression.*

> 2. **Unsupervised Learning**: *Unsupervised learning involves working with unlabeled data. The algorithm attempts to find patterns, structures, or groupings within the data without specific guidance. Common techniques include clustering for grouping similar data and dimensionality reduction to simplify data representation.*

> 3. **Reinforcement Learning**: *Reinforcement learning is used for training agents to make sequential decisions in an environment to maximize a cumulative reward. It involves an agent that learns by interacting with the environment, receiving feedback in the form of rewards or penalties, and adjusting its actions to improve its performance.*

### Two common tasks in machine learning   
   
> 1. **Regression** *is a type of supervised machine learning used for predicting a continuous numerical output.*    
   
> 2. **Classification** *is another type of supervised machine learning used for assigning input data to predefined categories or classes.*

### ⁉️**QUESTION**:    *What type of machine learning we will be using today*?   


### *Why regression?*   
   
> 1. **Quantitative Results**: *Regression models provide quantitative estimates, allowing you to make precise size predictions. This is particularly important when dealing with matters like disaster preparedness and risk assessment.*  
   
> 2. **Data-Driven**: *Cyclone size prediction often relies on historical data and meteorological measurements. Regression models can leverage this data to make predictions and are capable of incorporating additional factors, such as climate data or atmospheric conditions.*

## 🖖Now, we shall split our dataset into:   
   
1. **Testing set**: for testing our model.  
2. **Training set**: for training our model.  
  
  


In [7]:
# we don't need time and their Cyclone number as the features to train our model, otherwith it would cause "overfitting"
X = df[['Latitude','Longitude', 'Pressure', 'Wind Speed']] # features
y = df[['SiR34']] # target

Check our features and targets.

In [8]:
X.sample(5)

Unnamed: 0_level_0,Latitude,Longitude,Pressure,Wind Speed
Cyclone Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
514,24.85,132.15,935,48.2
430,9.58,110.23,985,26.0
24,29.05,141.42,982,28.2
8715,21.28,155.71,917,57.1
10,31.3,141.83,991,20.6


In [9]:
y.sample(5)

Unnamed: 0_level_0,SiR34
Cyclone Number,Unnamed: 1_level_1
1318,179.4
9235,260.8
1116,201.7
8631,150.9
8214,265.6


In [10]:
# library from sklearn to split the dataset into train & test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Convert the target variable from a column vector to a one-dimensional array using the ravel() function
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()
# X_train：features for training
# X_test：features for testing
# y_train：targets for training
# y_test：targets for testing

# test_size: ratio, 3:7
# random_state: random level


# print out the shapes of our splited dataset
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(10497, 4)
(4500, 4)
(10497,)
(4500,)


-----------------

# 🏗️*Model building*

# *1. Random Forest🌲*   
> *Random forest regression is an ensemble learning technique that combines multiple decision tree regressors to make predictions. It works by averaging the predictions of these individual trees to reduce overfitting and provide more accurate and robust predictions. Random forests are versatile and well-suited for both simple and complex regression tasks.*

![](https://miro.medium.com/v2/resize:fit:1400/1*jE1Cb1Dc_p9WEOPMkC95WQ.png)

In [11]:
# sklearn library includes algorithm models.
from sklearn.ensemble import RandomForestRegressor
# define the algorithm
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
# train the model by using our training sets
rf_regressor.fit(X_train, y_train)
# predict the result according to our test features data
rf_predict = rf_regressor.predict(X_test)

In [12]:
# create a new dataFrame to includ all the test features
rf_result = pd.DataFrame(X_test, columns=X_test.columns)

# add y_test
rf_result['Size_test'] = y_test

# add the prediction
rf_result['Size_pred'] = rf_predict

# show the result
print(rf_result)

                Latitude  Longitude  Pressure  Wind Speed  Size_test  \
Cyclone Number                                                         
1123               17.75     113.02       986        24.3      175.9   
9428               37.97     148.88       972        29.1      228.5   
304                32.15     132.25       986        23.1      189.4   
8219               19.02     129.52       972        28.9      160.3   
9326               21.75     126.08       990        21.7      307.8   
...                  ...        ...       ...         ...        ...   
1121               19.83     111.05       967        33.7      207.8   
919                18.53     111.63       994        18.9      157.7   
9309               18.95     106.75       985        23.9      155.2   
9122               22.98     130.33       980        26.0      147.0   
806                18.38     132.90       957        39.7      182.8   

                Size_pred  
Cyclone Number             
1123   

# *2. Linear Regression🔏*    
> *Linear regression is a simple and interpretable regression technique that models the relationship between one or more input variables and a continuous target variable. It assumes a linear relationship between the inputs and the target, represented by a straight line in simple linear regression or a hyperplane in multiple linear regression. The goal is to find the best-fitting line that minimizes the sum of squared errors.*

![](https://images.shiksha.com/mediadata/ugcDocuments/images/wordpressImages/2022_04_linear-Regression-1.jpg)

In [13]:
# # sklearn library includes algorithm models.
from sklearn.linear_model import LinearRegression
# define the algorithm
linear_regressor = LinearRegression()
# train the model by using our training sets
linear_regressor.fit(X_train, y_train)
# predict the result according to our test features data
lr_predict = linear_regressor.predict(X_test)

In [14]:
# create a new dataFrame and add all the attributes of text features
lr_result = pd.DataFrame(X_test, columns=X_test.columns)

# add y_test
lr_result['Size_test'] = y_test

# add the prediction
lr_result['Size_pred'] = lr_predict

# show the whole result
print(lr_result)

                Latitude  Longitude  Pressure  Wind Speed  Size_test  \
Cyclone Number                                                         
1123               17.75     113.02       986        24.3      175.9   
9428               37.97     148.88       972        29.1      228.5   
304                32.15     132.25       986        23.1      189.4   
8219               19.02     129.52       972        28.9      160.3   
9326               21.75     126.08       990        21.7      307.8   
...                  ...        ...       ...         ...        ...   
1121               19.83     111.05       967        33.7      207.8   
919                18.53     111.63       994        18.9      157.7   
9309               18.95     106.75       985        23.9      155.2   
9122               22.98     130.33       980        26.0      147.0   
806                18.38     132.90       957        39.7      182.8   

                 Size_pred  
Cyclone Number              
1123 

# *3. Neural Network🧠*    
> *Neural network regression involves using artificial neural networks (ANNs) for regression tasks. ANNs consist of interconnected layers of nodes (neurons) and are capable of modeling complex, non-linear relationships between input and output variables. Neural network regression is highly flexible and can capture intricate patterns in data, making it suitable for a wide range of regression problems, including those with non-linear dependencies.*

![b254bcbb2dc933f832a07ce51629e9f.png](attachment:c85a7078-333a-410d-8c01-eb349f71a582.png)

In [15]:
# We use TensorFlow as our deep learning structure as well as keras.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# define the model
model = keras.Sequential([
    layers.Input(shape=(X_train.shape[1],)),  # input layer，input our training feature data here 
    layers.Dense(8, activation='relu'),      # hidden layer 1，could be modified
    layers.Dense(4, activation='relu'),      # hidden layer 2
    layers.Dense(1)                          # output layer
])

# complie the model.
model.compile(optimizer='adam', loss='mean_squared_error')
# the epochs and rounds of training
model.fit(X_train, y_train, epochs=10, batch_size=32) # using our training sets
# prediction
nn_predict = model.predict(X_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [16]:
# create a new dataFrame and add all the attributes of text features
nn_result = pd.DataFrame(X_test, columns=X_test.columns)

# add y_test
nn_result['Size_test'] = y_test

# add on the NN prediction
nn_result['Size_pred'] = nn_predict

# show the result
print(nn_result)


                Latitude  Longitude  Pressure  Wind Speed  Size_test  \
Cyclone Number                                                         
1123               17.75     113.02       986        24.3      175.9   
9428               37.97     148.88       972        29.1      228.5   
304                32.15     132.25       986        23.1      189.4   
8219               19.02     129.52       972        28.9      160.3   
9326               21.75     126.08       990        21.7      307.8   
...                  ...        ...       ...         ...        ...   
1121               19.83     111.05       967        33.7      207.8   
919                18.53     111.63       994        18.9      157.7   
9309               18.95     106.75       985        23.9      155.2   
9122               22.98     130.33       980        26.0      147.0   
806                18.38     132.90       957        39.7      182.8   

                 Size_pred  
Cyclone Number              
1123 

## ✍️*Now we have trained all the simple model..*   
## 🧪*Even we attached the prediction in the result, but it is not enough for us to **evaluate** the performance!*

### 🧮Since it's numerical and real-world-size data, we would choose:    

> ***Mean Absolute Error，MAE***  
> *MAE is the mean of the absolute errors between the actual observations and the model predictions, smaller values of MAE are preferred*
  
$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$   
   
$n$ means *sample size*   
$y_i$ means *actual value*   
$\hat{y}_i$ means *prection value from model*

In [17]:
# import the library from sklearn to calculate the MAE
from sklearn.metrics import mean_absolute_error

# Calculate the MAE value for each model

# apply our test targtes with the prediction from each model
# random forest
rf_mae = mean_absolute_error(y_test, rf_predict)
# linear regression
lr_mae = mean_absolute_error(y_test, lr_predict)
# neural networks
nn_mae = mean_absolute_error(y_test, nn_predict)

# print out the result, KM as the radius, explained in the previous description
print("MAE of Random Forest: ", rf_mae, "KM")
print("MAE of Linear Regression: ", lr_mae, "KM")
print("MAE of Nerual Networks: ", nn_mae, "KM")

MAE of Random Forest:  26.718619999999998 KM
MAE of Linear Regression:  27.922280280283484 KM
MAE of Nerual Networks:  28.326878078545462 KM


## 🤔*Pretty close huh?*
Every models and algorithms got **potential** to get higher accuracy,  
   
It's now for you to discover how to **optimize** your model later in your journey in machine learning!


-----------------------------------

# *🗺️Now, shall we test our results on the MAP again?*

## 🤌*Let's randomly pick one row of data* 
*as our test tropical cyclone now*   
*and this one will be remain on the map*   

*and the results from our models will be stated as different color.*

In [18]:
import random

# from original df
row = df.sample(n=1, random_state=42)  # choose from our X, feature set
# ready for the application in model prediction
random_row = row[['Latitude','Longitude', 'Pressure', 'Wind Speed']]
# print it out
print(random_row)

                Latitude  Longitude  Pressure  Wind Speed
Cyclone Number                                           
1123               17.75     113.02       986        24.3


In [19]:
# latitude, longitude, size of chosen row
random_lat = row['Latitude'].values[0].tolist()
random_lon = row['Longitude'].values[0].tolist()
random_size = row['SiR34'].values[0].tolist()

## 🌀*Now, the **blue** circle is the original recorded data, we set it as the standard*.

In [20]:
# the location means the zoom location, and zoom_start means the initial zoom level
test_maps = folium.Map(location=[random_lat, random_lon], zoom_start= 6)


# put the data into the circle
# test circle
folium.Circle(
    location=[random_lat, random_lon], # lat and lon from chosen row
    radius= random_size * 1000, # times 1000 to match the real scale
    color='blue',
    fill=True,
    fill_color='blue',
    fill_opacity=0.2
).add_to(test_maps) # add to the maps

<folium.vector_layers.Circle at 0x7db153a117e0>

## *After that, we shall put our predictions on the map*🌪️   
   
0. 🔵**Blue Circle**: for the test size of cyclone
1. 🟢**Green Circle**: for the prediction from *Random Forest*  
2. 🟡**Yellow Circle**: for the prediction from *Linear Regression*
3. 🔴**Red Circle**: for the prediction from *Neural Network*

### *Size prediction of our chosen one...*

In [21]:
rf_size = float(rf_regressor.predict(random_row)[0])
lr_size = float(linear_regressor.predict(random_row)[0])
nn_size = float(model.predict(random_row)[0])
## prediction and the chosen one's size
print(rf_size, lr_size, nn_size, random_size)

156.37099999999998 162.03583067178272 161.8043670654297 175.9


### 🟢*Add Random Forest's result of troplical cyclone size on the map*

In [22]:
# random forest
folium.Circle(
    location=[random_lat, random_lon], # lat and lon from chosen row
    radius= rf_size * 1000, # times 1000 to match the real scale
    color='green',
    fill=True,
    fill_color='green',
    fill_opacity=0.2
).add_to(test_maps) # add to the maps

<folium.vector_layers.Circle at 0x7db153a13fa0>

### 🟡*Add Linear Regression's result of troplical cyclone size on the map*

In [23]:
# lineanr regression
folium.Circle(
    location=[random_lat, random_lon], # lat and lon from chosen row
    radius= lr_size * 1000, # times 1000 to match the real scale
    color='yellow',
    fill=True,
    fill_color='yellow',
    fill_opacity=0.2
).add_to(test_maps) # add to the maps

<folium.vector_layers.Circle at 0x7db0f84eaf80>

### 🔴*Add Neural Network's result of troplical cyclone size on the map*

In [24]:
# nerual network
folium.Circle(
    location=[random_lat, random_lon], # lat and lon from chosen row
    radius= nn_size * 1000, # times 1000 to match the real scale
    color='red',
    fill=True,
    fill_color='red',
    fill_opacity=0.2
).add_to(test_maps) # add to the maps

<folium.vector_layers.Circle at 0x7db153a114e0>

### 🖹*Finally, show the map! To check our model's performance on map!*

In [25]:
test_maps

----------------------------------------------------

# *What about deploy our model on a web?*   
   
   
[Size Prediction - Web](https://tropical-cyclone-size.streamlit.app/)

![image.png](attachment:1c2988c4-87c2-4811-89cd-dbd62eafeab6.png)

# 🥳***At the end***

**You have done the workshop!**      
**You did a great job!**  🥂

*WELCOME TO THE WORLD OF MACHINE LEARNING!*

🎉Thank you for participating this workshop from Google Developer Student Clubs - Machine Learning Department. 🎉  
   
Really hope you enjoy this notebook and the code,   
   
and more importantly, gain some knowledge!
   
Feel free to ask questions and give the comments, those are highly appreciated!   
       
Looking forward to see you in the next ML workshop!
   
ZHENG AO