# Final Project: Rain Prediction in Australia

Build models with:
1.  Linear Regression
2.  KNN
3.  Decision Trees
4.  Logistic Regression
5.  SVM

Metrics:
1. Accuracy Score
2. Jaccard Index
3. F1-Score
4. LogLoss
5. Mean Absolute Error
6. Mean Squared Error
7. R2-Score

Evaluate the models using:
1. Accuracy Score
2. Jaccard Index
3. F1-Score
4. LogLoss
5. Mean Absolute Error
6. Mean Squared Error
7. R2-Score

### Step 1. Import Packages

In [9]:
import pandas as pd
import numpy as np

# Regression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

# Classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm

# Evaluation Metrics
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics


### Step 2. Read in the data
- The dataset contains observations of weather metrics for each dat from 2008 to 2017
- [Download dataset here](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv)

In [12]:
df = pd.read_csv("/Users/pc/Desktop/IBM AI Engineer/Machine Learning with Python/Weather_Data.csv")
df.head()


Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Step 3. Data preprocessing

### One Hot Encoding: to convert categorical variables to binary varibales

In [13]:
df_sydney_processed = pd.get_dummies(data = df, columns = ["RainToday", "WindGustDir", "WindDir9am", "WindDir3pm"])

In [14]:
# Categorical column => numerical column
# Replace "Rain tomorrow" column changing them from a categorical column to a binary column.
df_sydney_processed.replace(["No", "Yes"], [0, 1], inplace = True)


### Step 4. Train and Test data

In [16]:
# Set "Features": x valuss and "Target" variable: y values
df_sydney_processed.drop("Date", axis = 1, inplace = True)
df_sydney_processed = df_sydney_processed.astype(float)

features = df_sydney_processed.drop(columns = "RainTomorrow", axis = 1)
Y = df_sydney_processed["RainTomorrow"]

### Step 5. Model: Linear Regression

#### Q1. 
- Use the train_test_split function to split the features
- Y dataframes with a test_size of 0.2
- The randome_state ste to 10

In [27]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size = 0.2, random_state = 10)


#### Q2. Create and train a Linear Regression model called LinearReg using training data(x_train, y_train)

In [28]:
LinearReg = LinearRegression()
LinearReg.fit(x_train, y_train)

#### Q3. Use the predict method on the testing data (x_test) and save it to the array predictions

In [26]:
predictions = LinearReg.predict(x_test)

#### Q4. Using the predictions and the y_test dataframee calculate the value for each metric using the appropriate function


In [33]:
LinearRegression_MAE = np.mean(np.absolute(y_test - predictions))
LinearRegression_MSE = np.mean(np.mean((y_test - predictions)**2))
from sklearn.metrics import r2_score
LinearRegression_R2  = r2_score(y_test, predictions)


#### Q5. Show the MAE, MSE, and R2 in a tabular format using data frame for the Linear model

In [73]:
data = {
    'MAE': [LinearRegression_MAE],
    'MSE': [LinearRegression_MSE],
    'MAE': [LinearRegression_R2]
}
row_names = ['Linear Regression']
Report = pd.DataFrame(data, index = row_names)
print(Report)

                        MAE       MSE
Linear Regression  0.427132  0.115721


In [65]:

data = {
    'Age': [25, 30, 22],
    'Country': ['USA', 'Canada', 'UK']
}

# Define row names
row_names = ['Alice', 'Bob', 'Charlie']

# Create a DataFrame with specified row names
df = pd.DataFrame(data, index=row_names)
print(df)

         Age Country
Alice     25     USA
Bob       30  Canada
Charlie   22      UK


### Q6. KNN
- Create and train a KNN model called KNN using the training data (x_train, y_train) 
- with the n_neighbors parameter set to 4

In [37]:
k = 4
KNN = KNeighborsClassifier(n_neighbors = k).fit(x_train, y_train)

### Q7. Prediction
- Use the predict method on the testing data x_test and save it to the array predictions

In [39]:
x_test_norm = preprocessing.StandardScaler().fit(x_test).transform(x_test.astype(float))
predictions = KNN.predict(x_test_norm)




### Q8.
- Use the predictions and the y_test dataframe to calculate the value for each metric using the appropriate function

In [44]:
KNN_Accuracy_Score = accuracy_score(y_test, predictions)
KNN_JacardIndex    = jaccard_score(y_test, predictions)
KNN_F1_Score       = f1_score(y_test, predictions)

### Q9. Decision Tree
- Create and train a Decision Tree model called Tree using the training data (x_train, y_train)

In [46]:
Tree = DecisionTreeClassifier(criterion = "entropy", max_depth = 4)
Tree.fit(x_train, y_train)

### Q10. 
- Use the predict method on the testing data x_test
- save it to the array predictions

In [47]:
predictions = Tree.predict(x_test)

### Q11.
- Use the predictions and the y_test dataframe to calculate the value for each metric using the appropriate function.

In [48]:
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex   = jaccard_score(y_test, predictions)
Tree_F1_Score       = f1_score(y_test, predictions)

### Q12. Logistic Regression
- Use the train_test_split function to split the features 
- Y dataframes with a testsize of 0.2
- the random_state set to 1

In [49]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size = 0.2, random_state = 1)


### Q13.
- Create and Train a Logistic Regression model called LR using the training data (x_train, y_train) with teh solver parameter set to liblinear

In [51]:
LR = LogisticRegression(C = 0.01, solver = 'liblinear').fit(x_train, y_train)


### Q14.
- Use the predict and predict_proba methods on the testing data (x_test) and save it as 2 arrays predictions and predict_proba

In [52]:
predictions = LR.predict(x_test)

In [53]:
predict_proba = LR.predict_proba(x_test)

### Q15. 
- Using the predictions, predict_proba and the y_test dataframe to calculate the value for each metric using the appropriate function.

In [55]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex   = jaccard_score(y_test, predictions)
LR_F1_Score       = f1_score(y_test, predictions)
LR_Log_Loss       = log_loss(y_test, predictions)

### Q16. SVM
- Create and train a SVM model called SVM using the training data (x_train, y_train)

In [58]:
SVM = svm.SVC(kernel='linear').fit(x_train, y_train)

### Q17
- Now use the predict method on the existing data (x_test) and save it to the array predictions

In [62]:
predictions = SVM.predict(x_test)

### Q18
- Use the predictions and the y_test dataframe to calculate the value for each metric using the appropriate function

In [63]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex   = jaccard_score(y_test, predictions)
SVM_F1_Score       = f1_score(y_test, predictions)

### Q19
- Show the accuracy, jaccard Index, F1-score, logloss in a tabular format using dataframe for all of the above models

In [74]:
data = {
    'Accuracy Score': [KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
    'Jaccard Index': [KNN_JacardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],
    'F1-score': [KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],
    'LogLoss': ['', '', LR_Log_Loss, '']
}
row_names = ['KNN', 'Decision Tree', 'Logistic Regression', 'SVM']
Report = pd.DataFrame(data, index = row_names)
print(Report)

                     Accuracy Score  Jaccard Index  F1-score   LogLoss
KNN                        0.719084       0.000000  0.000000          
Decision Tree              0.818321       0.480349  0.648968          
Logistic Regression        0.827481       0.484018  0.652308  6.218218
SVM                        0.845802       0.534562  0.696697          
