#**Intermediate: Machine Learning**

### Authors: [Precious Darkwa](http://linkedin.com/in/darkwa-precious)


### Content

1. [Supervised Learning](#part1)
  - [Linear Models](#part1.1)
  - [Problem Statement](#part1.2)
    * [Feature Descriptions](#part1.2.1)
    * [Save model](#part1.2.2)
    * [OOP Approach](#part1.2.3)
    * [Procedural Approach](#part1.2.4)
    * [Exercise](#part1.2.5)

<a name="part1"></a>


# **Supervised Learning**
Supervised learning is a branch of machine learning where the algorithm learns from labeled data, which means each input data point is associated with a corresponding target label. The goal of supervised learning is to learn a mapping from input variables to output variables.

In supervised learning, there are two main classes of models:

1. **Regression Models**: These models are used when the target variable is continuous. The goal of regression is to predict a continuous outcome. Examples of regression models include linear regression, polynomial regression, support vector regression (SVR), and ridge regression.

2. **Classification Models**: These models are used when the target variable is categorical, meaning it belongs to a specific class or category. The goal of classification is to predict the class label of new data points. Examples of classification models include logistic regression, decision trees, random forests, support vector machines (SVM), k-nearest neighbors (KNN), and naive Bayes.



<a name="part1.1"></a>


## **Linear Models**

1. **Linear Regression**:
   - Linear regression is a basic and widely used statistical technique for modeling the relationship between a dependent variable and one or more independent variables.
   - It assumes a linear relationship between the independent variables and the dependent variable.
   - The model equation is of the form: y = B0 + B1x1 + B2x2 + .... + Bnxn + E1 where Bi's are the coefficients, xi's are the independent variables and E is the error term

   ```python
   from sklearn.linear_model import LinearRegression

   # Create a linear regression model
   model = LinearRegression()

   # Fit the model to the data
   model.fit(X_train, y_train)

   # Make predictions
   predictions = model.predict(X_test)
   ```

2. **Logistic Regression**:
   - Logistic regression is a linear model used for binary classification tasks.
   - It models the probability that a given instance belongs to a particular class.
   - Despite its name, it's a classification algorithm, not a regression algorithm.

   ```python
   from sklearn.linear_model import LogisticRegression

   # Create a logistic regression model
   model = LogisticRegression()

   # Fit the model to the data
   model.fit(X_train, y_train)

   # Make predictions
   predictions = model.predict(X_test)
   ```

3. **Support Vector Machine (SVM)**:
   - SVM is a powerful supervised learning algorithm used for both classification and regression tasks.
   - In classification, it finds the hyperplane that best separates the classes.
   - It can handle linear and non-linear data through the use of different kernel functions.

   ```python
   from sklearn.svm import SVC

   # Create an SVM classifier
   model = SVC(kernel='linear')

   # Fit the model to the data
   model.fit(X_train, y_train)

   # Make predictions
   predictions = model.predict(X_test)
   ```



<a name="part1.2"></a>


## **Problem Statement**:
The challenge at hand involves addressing the surge in building collapses within Lagos and major cities in Nigeria. In response, there's a need to develop a predictive model that can forecast whether a building will file an insurance claim within a specific timeframe. The objective is to ascertain the likelihood of a building experiencing at least one insurance claim during its insured period based on various building characteristics. The target variable, "Claim," will be binary:

1 if the building has at least one insurance claim during the insured period.

0 if the building does not file any insurance claims during the insured period.

<a name="part1.2.1"></a>


### **Descriptions for each features in the dataset**

1. **Building_Painted**: A binary variable indicating whether the building has been painted or not. In the data provided, "N" represents "Not painted" and "V" represents "Painted".

2. **Building_Fenced**: A binary variable indicating whether the building is fenced or not. In the data provided, "N" represents "Not fenced" and "V" represents "Fenced".

3. **Garden**: A binary variable indicating the presence or absence of a garden in or around the building. In the data provided, "O" represents "No garden" and "V" represents "Has a garden".

4. **Settlement**: A categorical variable indicating the type of settlement where the building is located. In the data provided, "U" represents "Urban" and "R" represents "Rural".

5. **NumberOfWindows**: A categorical variable indicating the number of windows in the building. In the data provided, numerical values represent the count of windows.

6. **Customer Id**: An identifier assigned to each customer. It uniquely identifies each entry in the dataset.

7. **YearOfObservation**: The year in which the observation or data collection was made. It provides temporal context for the recorded data.

8. **Insured_Period**: The duration for which the building is insured, typically represented in years. It indicates the length of time the insurance policy covers.

9. **Residential**: A binary variable indicating whether the building is used for residential purposes or not. In the data provided, "0" represents "Not residential" and "1" represents "Residential".

10. **Building Dimension**: The physical size or dimensions of the building, typically measured in square meters or a similar unit. It provides quantitative information about the building's size.

11. **Building_Type**: A categorical variable indicating the type or classification of the building. In the data provided, numerical values represent different building types.

12. **Date_of_Occupancy**: The year in which the building was occupied or became operational. It provides information on the age of the building.

13. **Geo_Code**: A code representing the geographical location or region where the building is situated. It helps identify the geographic distribution of buildings.

14. **Claim**: The target variable indicating whether the building has filed an insurance claim (1) or not (0) during the insured period. It serves as the outcome variable for predictive modeling.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [None]:
file_path = '/content/drive/My Drive/Insurance Prediction/train_data.csv'
file_path_test = '/content/drive/My Drive/Insurance Prediction/test_data.csv'


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import pickle
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, accuracy_score
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')


In [None]:
df = pd.read_csv(file_path)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
pd.unique(df['NumberOfWindows'])

In [None]:
pd.unique(df['Geo_Code'])

In [None]:
pd.unique(df['Building_Painted'])

In [None]:
pd.unique(df['Building_Fenced'])

In [None]:
pd.unique(df['Settlement'])

In [None]:
pd.unique(df['Garden'])

In [None]:
# Strip spaces and handle special cases in the 'NumberOfWindows' column
df['NumberOfWindows'] = df['NumberOfWindows'].str.strip().replace({
    '.': np.nan,
    '>=10': 10
}).astype(float)


In [None]:
numerical_columns = df.select_dtypes(include=['float64','int64']).columns
categorical_columns = df.select_dtypes(include=['object']).columns

In [None]:
numerical_impute = SimpleImputer(strategy='median')


In [None]:
for col in numerical_columns:
  df[col] = numerical_impute.fit_transform(df[[col]])

In [None]:
for col in categorical_columns:
  df[col] = df[col].fillna(df[col].mode()[0])

In [None]:
label = LabelEncoder()

In [None]:
for col in categorical_columns:
  df[col] = label.fit_transform(df[col])

In [None]:
df.info()

In [None]:
# Define features and target variable
X = df.drop(columns=['Customer Id', 'Claim'])
y = df['Claim']


In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
# Build and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [None]:
# Evaluate the model
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]

In [None]:
# y_test, y_pred, and y_pred_prob are defined
conf_matrix = confusion_matrix(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)
accuracy = accuracy_score(y_test,y_pred)

print("Confusion Matrix:\n", conf_matrix)
print("ROC AUC Score:", round(roc_auc,2))
print("accuracy : " + str(round(accuracy,2) * 100) + " %")

In [None]:
# Print the classification report
classification_report_output = classification_report(y_test, y_pred)
print(classification_report_output)

In [None]:
# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()


<a name="part1.2.2"></a>

### **Saving the Model using Pickle**

In [None]:
# Save the model to a file
pickle_file = 'logistic_regression_model.pkl'
with open(pickle_file, 'wb') as file:
    pickle.dump(model, file)

# Load the model from the file
with open(pickle_file, 'rb') as file:
    loaded_model = pickle.load(file)


# Evaluate the model
y_pred = loaded_model.predict(X_test)
y_pred_prob = loaded_model.predict_proba(X_test)[:, 1]

roc_auc = roc_auc_score(y_test, y_pred_prob)

# Print the classification report
classification_report_output = classification_report(y_test, y_pred)
accuracy = round(accuracy_score(y_test, y_pred),2)
print(accuracy)



<a name="part1.2.2"></a>

### **Another Method to Save the Model**

In [None]:
# Save the model to a file
joblib_file = 'logistic_regression_model.joblib'
joblib.dump(model,joblib_file)


In [None]:
# Load the model from the file
loaded_model = joblib.load(joblib_file)


<a name="part1.2.3"></a>


### **OOP APPROACH FOR THE MODEL CREATION**

In [None]:
class MachineLearningPipeline:
    def __init__(self, model_path):
        self.model_path = model_path
        self.model = None
        self.label_encoders = None

    def load_data(self, data_path):
        # Load the dataset
        return pd.read_csv(data_path)

    def preprocess_data(self, data, training=True):
        # Identify numerical and categorical columns
        numerical_columns = data.select_dtypes(include=['float64', 'int64']).columns
        categorical_columns = data.select_dtypes(include=['object']).columns

        # Impute missing values for numerical columns
        numerical_imputer = SimpleImputer(strategy='median')
        for col in numerical_columns:
            data[col] = numerical_imputer.fit_transform(data[[col]])

        # Impute missing values for categorical columns
        for col in categorical_columns:
            data[col] = data[col].fillna(data[col].mode()[0])

        # Convert categorical variables to numerical using LabelEncoder
        self.label_encoders = {}
        for col in categorical_columns:
            le = LabelEncoder()
            data[col] = le.fit_transform(data[col])
            self.label_encoders[col] = le

        if training:
            # Split the data into features and target
            X = data.drop(['Customer Id', 'Claim'], axis=1)
            y = data['Claim']
            return X, y
        else:
            # For new data prediction
            X_new = data.drop(['Customer Id'], axis=1)
            return X_new

    def train_model(self, X, y):
        # Train a logistic regression model
        self.model = LogisticRegression(max_iter=1000)
        self.model.fit(X, y)

    def save_model(self):
        # Save the model to a file
        with open(self.model_path, 'wb') as file:
            pickle.dump(self.model, file)

    def evaluate_model(self, X, y):
        # Predict and evaluate the model
        y_pred = self.model.predict(X)
        y_pred_prob = self.model.predict_proba(X)[:, 1]

        conf_matrix = confusion_matrix(y, y_pred)
        roc_auc = round(roc_auc_score(y, y_pred_prob),2)
        classification_report_output = classification_report(y, y_pred)
        accuracy = round(accuracy_score(y, y_pred),2) * 100

        print("Confusion Matrix:")
        print(conf_matrix)
        print("\nROC AUC Score:")
        print(roc_auc)
        print("\nClassification Report:")
        print(classification_report_output)
        print(f"\nAccuracy: {accuracy} %")

    def load_model(self):
        # Load the model from a file
        with open(self.model_path, 'rb') as file:
            self.model = pickle.load(file)

# Usage
pipeline = MachineLearningPipeline('logistic_regression_model.pkl')
data = pipeline.load_data(file_path)
X, y = pipeline.preprocess_data(data, training=True)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Train model
pipeline.train_model(X_train, y_train)
pipeline.save_model()

# Evaluate model
pipeline.evaluate_model(X_test, y_test)

# Load model
pipeline.load_model()



<a name="part1.2.4"></a>

### **PROCEDURAL APPROACH FOR MODEL CREATION**

In [None]:
def load_data(data_path):
    # Load the dataset
    return pd.read_csv(data_path)

def preprocess_data(data):
    # Your preprocessing code
    # Identify numerical and categorical columns
    numerical_columns = data.select_dtypes(include=['float64', 'int64']).columns
    categorical_columns = data.select_dtypes(include=['object']).columns

    # Impute missing values for numerical columns
    numerical_imputer = SimpleImputer(strategy='median')
    for col in numerical_columns:
        data[col] = numerical_imputer.fit_transform(data[[col]])

    # Impute missing values for categorical columns
    for col in categorical_columns:
        data[col] = data[col].fillna(data[col].mode()[0])

    # Convert categorical variables to numerical using LabelEncoder
    label_encoders = {}
    for col in categorical_columns:
        le = LabelEncoder()
        data[col] = le.fit_transform(data[col])
        label_encoders[col] = le

    return data, label_encoders

def train_model(X, y):
    # Train a logistic regression model
    model = LogisticRegression(max_iter=1000)
    model.fit(X, y)
    return model

def save_model(model, model_path):
    # Save the model to a file
    with open(model_path, 'wb') as file:
        pickle.dump(model, file)

def evaluate_model(model, X, y):
    # Predict and evaluate the model
    y_pred = model.predict(X)
    y_pred_prob = model.predict_proba(X)[:, 1]

    conf_matrix = confusion_matrix(y, y_pred)
    roc_auc = round(roc_auc_score(y, y_pred_prob), 2)
    classification_report_output = classification_report(y, y_pred)
    accuracy = round(accuracy_score(y, y_pred),2) * 100

    print("Confusion Matrix:")
    print(conf_matrix)
    print("\nROC AUC Score:")
    print(roc_auc)
    print("\nClassification Report:")
    print(classification_report_output)
    print(f"\nAccuracy: {accuracy} %")

def load_model(model_path):
    # Load the model from a file
    with open(model_path, 'rb') as file:
        return pickle.load(file)

# Usage
data_path = file_path
model_path = 'logistic_regression_model.pkl'

# Load and preprocess data
data = load_data(data_path)
data, _ = preprocess_data(data)

# Split data into features and target
X = data.drop(['Customer Id', 'Claim'], axis=1)
y = data['Claim']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Train model
model = train_model(X_train, y_train)
# Save model
save_model(model, model_path)

# Evaluate model
evaluate_model(model, X_test, y_test)




In [None]:
# Load model
loaded_model = load_model(model_path)

# Evaluate loaded model
evaluate_model(loaded_model, X_test, y_test)

### **PROCEDURAL APPROACH FOR THE TEST DATA**

In [None]:
def load_model(model_path):
    # Load the model from a file
    with open(model_path, 'rb') as file:
        return pickle.load(file)

def preprocess_data(data):
    # Identify numerical and categorical columns
    numerical_columns = data.select_dtypes(include=['float64', 'int64']).columns
    categorical_columns = data.select_dtypes(include=['object']).columns

    # Impute missing values for numerical columns
    numerical_imputer = SimpleImputer(strategy='median')
    for col in numerical_columns:
        data[col] = numerical_imputer.fit_transform(data[[col]])

    # Impute missing values for categorical columns
    for col in categorical_columns:
        data[col] = data[col].fillna(data[col].mode()[0])

    # Convert categorical variables to numerical using LabelEncoder
    label_encoders = {}
    for col in categorical_columns:
        le = LabelEncoder()
        data[col] = le.fit_transform(data[col])
        label_encoders[col] = le

    return data, label_encoders

def save_predictions(predictions, output_path):
    # Save predictions to an excel file
    predictions.to_excel(output_path, index=False)

def predict(data_path, model_path, output_path):
    # Load data
    data = pd.read_csv(data_path)

    # Preprocess data
    data, _ = preprocess_data(data)

    # Load model
    model = load_model(model_path)

    # Make predictions
    X = data.drop('Customer Id', axis=1)  # Assuming 'Customer Id' is present in data
    predictions = model.predict(X)

    # Create DataFrame with customer_id and predictions
    customer_id = data['Customer Id']
    predictions_df = pd.DataFrame({'Customer Id': customer_id, 'Predictions': predictions})

    # Save predictions to excel
    save_predictions(predictions_df, output_path)

# Usage
data_path = file_path_test
model_path = 'logistic_regression_model.pkl'
output_path = 'predictions.xlsx'
predict(data_path, model_path, output_path)


<a name="part1.2.5"></a>

### **EXERCISE**

### **OOP APPROACH FOR THE TEST DATA**

Write the code for the object-oriented programming (OOP) approach for the test data

## Congrats! That's it for this tutorial.


### Author(s):
Precious Darkwa, Data Science/Analytics Instructor @ Blossom Academy

Email: preciousdarkwa@gmail.com

---

*This notebook was originally created by Ghana Data Science Summit for the [IndabaX Ghana](https://www.indabaxghana.com/) 2024 Conference and is published under [MIT license](https://choosealicense.com/licenses/mit/).*