# Model Training

## 1.1 Import Data and Required Packages

In this section, we will import the necessary libraries and packages required for our data analysis and modeling process. 

### Basic Imports

First, we need to import the fundamental libraries for data manipulation, visualization, and modeling:


In [1]:
# Importing regression models and metrics from scikit-learn
from sklearn.metrics import mean_squared_error, r2_score  # For evaluating model performance
from sklearn.neighbors import KNeighborsRegressor  # K-Nearest Neighbors Regressor
from sklearn.tree import DecisionTreeRegressor  # Decision Tree Regressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor  # Ensemble methods
from sklearn.svm import SVR  # Support Vector Regressor
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

## Importing the CSV Data as a Pandas DataFrame
In this section, we will import our dataset from a CSV file into a Pandas DataFrame. This allows us to manipulate and analyze the data easily.


In [356]:
df = pd.read_csv('data\seattle-weather.csv')
df.head()

  df = pd.read_csv('data\seattle-weather.csv')


Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain


## Dropping Unnecessary Columns from the DataFrame

In this section, we will drop specific columns from our DataFrame that are not needed for our analysis. The columns we will drop are: `date'

In [357]:
## Dropping Unnecessary Columns from the DataFrame
# Dropping specified columns from the DataFrame
df.drop('date', inplace = True, axis = 1)


In [358]:
df.head()

Unnamed: 0,precipitation,temp_max,temp_min,wind,weather
0,0.0,12.8,5.0,4.7,drizzle
1,10.9,10.6,2.8,4.5,rain
2,0.8,11.7,7.2,2.3,rain
3,20.3,12.2,5.6,4.7,rain
4,1.3,8.9,2.8,6.1,rain


## Preparing X and Y Variables

In this section, we will prepare the feature set `X` and the target variable `Y` for our machine learning model. We will drop the `weather` column from the DataFrame to create `X`, which will contain the features used for prediction. The `Y` variable will be the `weather` column that we want to predict.


In [359]:
X = df.drop(columns=['weather'],axis=1)

In [360]:
X

Unnamed: 0,precipitation,temp_max,temp_min,wind
0,0.0,12.8,5.0,4.7
1,10.9,10.6,2.8,4.5
2,0.8,11.7,7.2,2.3
3,20.3,12.2,5.6,4.7
4,1.3,8.9,2.8,6.1
...,...,...,...,...
1456,8.6,4.4,1.7,2.9
1457,1.5,5.0,1.7,1.3
1458,0.0,7.2,0.6,2.6
1459,0.0,5.6,-1.0,3.4


In [361]:
y = df['weather']

In [362]:
y

0       drizzle
1          rain
2          rain
3          rain
4          rain
         ...   
1456       rain
1457       rain
1458        fog
1459        sun
1460        sun
Name: weather, Length: 1461, dtype: object

## Creating a Column Transformer with Multiple Transformers

In this section, we will create a Column Transformer that applies different preprocessing techniques to numerical and categorical features in our dataset. This will help us prepare the data for modeling.

### Step 1: Identify Numerical and Categorical Features

We first identify the numerical and categorical features from our feature set `X`.'Y'

In [363]:
# separate dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape

((1168, 4), (293, 4))

In [364]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Sample data preparation (make sure to replace this with your actual data)
# X, y = load_your_data()  # Load your data here
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define classification models
models = {
    "Logistic Regression": LogisticRegression(),
    "K-Neighbors Classifier": KNeighborsClassifier(),
    "Decision Tree Classifier": DecisionTreeClassifier(),
    "Random Forest Classifier": RandomForestClassifier(),
    "XGBClassifier": XGBClassifier(),
    "CatBoost Classifier": CatBoostClassifier(verbose=False),
    "AdaBoost Classifier": AdaBoostClassifier()
}

# Function to evaluate model performance
def evaluate_model(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')
    return accuracy, precision, recall, f1

# List to store model names and their test accuracies
model_list = []
accuracy_list = []

# Loop through models and train them
for name, model in models.items():
    model.fit(X_train, y_train)  # Train model
    
    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate performance for the training set
    train_accuracy, train_precision, train_recall, train_f1 = evaluate_model(y_train, y_train_pred)
    
    # Evaluate performance for the test set
    test_accuracy, test_precision, test_recall, test_f1 = evaluate_model(y_test, y_test_pred)

    # Store the model name and test accuracy
    model_list.append(name)
    accuracy_list.append(test_accuracy)
    
    # Print model name
    print(name)
    
    # Print training set performance
    print('Model performance for Training set:')
    print(f"- Accuracy: {train_accuracy:.4f}")
    print(f"- Precision: {train_precision:.4f}")
    print(f"- Recall: {train_recall:.4f}")
    print(f"- F1-Score: {train_f1:.4f}")
    
    print('----------------------------------')
    
    # Print test set performance
    print('Model performance for Test set:')
    print(f"- Accuracy: {test_accuracy:.4f}")
    print(f"- Precision: {test_precision:.4f}")
    print(f"- Recall: {test_recall:.4f}")
    print(f"- F1-Score: {test_f1:.4f}")
    
    print('='*100)
    print('\n')

# Optionally: Print the models and their accuracies in a more readable format
print("Summary of Test Accuracies:")
for i, model in enumerate(model_list):
    print(f"{model}: Accuracy = {accuracy_list[i]:.4f}")

# Visualize results
plt.figure(figsize=(12, 6))
plt.barh(model_list, accuracy_list, color='skyblue')
plt.xlabel('Accuracy')
plt.title('Model Accuracy Comparison')
plt.xlim(0, 1)
plt.show()


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Logistic Regression
Model performance for Training set:
- Accuracy: 0.8493
- Precision: 0.7739
- Recall: 0.8493
- F1-Score: 0.8034
----------------------------------
Model performance for Test set:
- Accuracy: 0.8328
- Precision: 0.7513
- Recall: 0.8328
- F1-Score: 0.7752




  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


K-Neighbors Classifier
Model performance for Training set:
- Accuracy: 0.8134
- Precision: 0.8033
- Recall: 0.8134
- F1-Score: 0.8030
----------------------------------
Model performance for Test set:
- Accuracy: 0.7816
- Precision: 0.7380
- Recall: 0.7816
- F1-Score: 0.7526


Decision Tree Classifier
Model performance for Training set:
- Accuracy: 0.9974
- Precision: 0.9975
- Recall: 0.9974
- F1-Score: 0.9975
----------------------------------
Model performance for Test set:
- Accuracy: 0.7440
- Precision: 0.7445
- Recall: 0.7440
- F1-Score: 0.7438


Random Forest Classifier
Model performance for Training set:
- Accuracy: 0.9966
- Precision: 0.9966
- Recall: 0.9966
- F1-Score: 0.9966
----------------------------------
Model performance for Test set:
- Accuracy: 0.8089
- Precision: 0.7737
- Recall: 0.8089
- F1-Score: 0.7766




ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2 3 4], got ['drizzle' 'fog' 'rain' 'snow' 'sun']

In [228]:
# Initialize the logistic regression model
lin_model = LogisticRegression(fit_intercept=True, max_iter=200)

# Fit the model to the training data
lin_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = lin_model.predict(X_test)

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred) * 100


# Print the accuracy and other metrics
print("Accuracy of the model: %.2f%%" % accuracy)


Accuracy of the model: 100.00%


In [229]:
import pandas as pd 
pd.DataFrame(list(zip(model_list, accuracy_list)), columns=['Model Name', 'Accuracy'])

Unnamed: 0,Model Name,Accuracy
0,Logistic Regression,1.0
1,K-Neighbors Classifier,1.0
2,Decision Tree Classifier,1.0
3,Random Forest Classifier,1.0
4,XGBClassifier,1.0
5,CatBoost Classifier,1.0
6,AdaBoost Classifier,1.0


In [230]:
pred_df=pd.DataFrame({'Actual Value':y_test,'Predicted Value':y_pred,'Difference':y_test-y_pred})
pred_df

Unnamed: 0,Actual Value,Predicted Value,Difference
892,4,4,0
1105,2,2,0
413,2,2,0
522,4,4,0
1036,2,2,0
...,...,...,...
1361,4,4,0
802,2,2,0
651,1,1,0
722,2,2,0
