
Assignment Title (Lab 3) : Predictive Modeling for Insurance Claims
NAME: BUSINGYE CAROLINE

REG. NO.: 2023/HD05/04657U

STUDENT NO.: 230004657

MASTER OF SCIENCE IN COMPUTER SCIENCE - (MCSC)
---



Objective: Build a predictive model to determine if a building will have an insurance claim during a specific period using building characteristics. In this assignment, you will explore and apply four machine learning algorithms: Support Vector Machine (SVM), Linear Regression, k-nearest Neighbors (KNN), and Naive Bayes. The evaluation metric for this assignment is the Area Under the Curve (AUC).

Variable Description
Customer Id Identification number for the Policy holder
YearOfObservation year of observation for the insured policy
Insured_Period duration of insurance policy in Olusola Insurance. (Ex: Full year insurance, Policy Duration = 1; 6 months = 0.5
Residential is the building a residential building or not
Building_Painted is the building painted or not (N-Painted, V-Not Painted)
Building_Fenced is the building fence or not (N-Fenced, V-Not Fenced)
Garden building has garden or not (V-has garden O-no garden) Settlement Area where the building is located. (R- rural area U- urban area) Building Dimension Size of the insured building in m2
Building_Type The type of building (Type 1, 2, 3, 4)
Date_of_Occupancy date building was first occupied
NumberOfWindows number of windows in the building
Geo Code Geographical Code of the Insured building
Claim target variable. (0: no claim, 1: at least one claim over insured period).

In [None]:

from google.colab import drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

**importing important libraries**

In [None]:
# Importing important packages
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import classification_report
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from IPython.display import VimeoVideo
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.utils.validation import check_is_fitted
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

**loading the data set**

In [None]:
dft = pd.read_csv('/content/drive/MyDrive/MachineLearning/ML Lab/dataset/train_data.csv')


**displaying the first 10 rows of the dataset**

In [None]:

#viewing the first five rows of the dataset
dft.head()

In [None]:
#displaying the shape of the dataset
dft.shape

**Getting  the  quick overview of the dataset**

In [None]:
dft.shape

**The dataset has 7160 columns and 14 rows**

In [None]:

#checking data types
dft.info()

**This shows that out of the 14 columns, 3 contains float datatype, 4 has integer and 7 columns conatin strings as a datatype and that means there need for encoding the categorical variables**

**checking  for missing values  in the dataset**

In [None]:
dft.isnull().sum()

**The data set has missing values in some of the columns and hence they need to be treated to improve model perfomance**

**lets see total sum of   missing values**

In [None]:

dft.isnull().sum().sum()

**Having a total sum of 723 values has to be handled with carefully**

**displaying basic statistics**

In [None]:
dft.describe()

In [None]:
#displaying basic key statistics
dft.describe(include=object)

In [None]:
dft.columns

In [None]:
dft.isna().sum()

The variable Garden has 7 missing values, Building Dimension has 106, Date_of_occupancy has 508 and Geo_code has 102 missing values.


In [None]:
dft.columns

In [None]:
# Handling missing values,
# i will use mode for categoerical columns and median for numerical since some varriables are skewed
dft['Date_of_Occupancy']=dft['Date_of_Occupancy'].fillna(dft['Date_of_Occupancy'].median())
dft['Building Dimension']=dft['Building Dimension'].fillna(dft['Building Dimension'].median())
dft['Garden']=dft['Garden'].fillna(dft['Garden'].mode().iloc[0])
dft['Geo_Code'] = dft['Geo_Code'].fillna(dft['Geo_Code'].mode().iloc[0])
dft.info()

 All the missing values have been handled since all variables have total count of 7160

**Data distribution**

In [None]:
dft.columns

In [None]:
#displaying the statistics of Building dimension
dft['Building Dimension'].describe()

The Building Dimension column has outliers. For example, the max value is 20940.000000,while its min value is 1.000000. The mean is sensitive to outliers, but the fact the mean is so small compared to the max value indicates the max value is an outlier.

In [None]:
#displaying the statistics ofdate of occupancy
dft['Date_of_Occupancy'].describe()

In [None]:
import plotly.express as px

In [None]:
#create a box plot to visualize the outlier in the Building Dimension
fig = px.box(dft, y='Building Dimension')
fig.update_layout(height=400, width=500, title_text='Distribution of Building Dimension')
fig.show()

In [None]:
#create a box plot to visualize the outlier in the Building Dimension
fig = px.box(dft, y='Date_of_Occupancy')
fig.update_layout(height=400, width=500, title_text='Distribution of Building Dimension')
fig.show()

this visually tells us that there are outliers since building dimension skews from the left and date of occupancy skews from the left

Dealing with outliers using cap method

In [None]:
#To cap the outliers, calculate a upper limit and lower limit.
upper_limit = dft['Building Dimension'].mean() + 0.7*dft['Building Dimension'].std()
lower_limit = dft['Building Dimension'].mean() - 0.7*dft['Building Dimension'].std()
print('Upper_limit:',upper_limit)
print('Lower_limit:',lower_limit)

In [None]:
import numpy as np
#we use the numpy .where() function to apply the limits to Building Dimension.
dft['Building Dimension'] = np.where(dft['Building Dimension'] > upper_limit,upper_limit,
np.where(dft['Building Dimension'] < lower_limit,lower_limit,dft['Building Dimension']))

In [None]:
#displaying the new  statistics of Building Dimension after removing outliers
dft.describe()[['Building Dimension']]

Wow,this is okay now the max value is has been decreased to 3456.180514 and min

---

value has increased  to 287.565854, and the mean value is now 1490.370464. This shows the outlier has been delt with.

newly visualisation of building dimension after removing outliers

In [None]:
def plot_variable(dft,variable):
  plt.figure(figsize = (16,4))
  # histogram
  plt.subplot(1,2,1)
  plt.hist(dft[variable], alpha = 0.5)
  plt.title('Histogram for the distribution of Building Dimension')
  plt.xlabel('Building Dimension')
  plt.ylabel('Frequency')
  # boxplot
  plt.subplot(1,2,2)
  sns.boxplot(dft[variable])
  plt.title('A boxplot for the distribution of Building Dimension')
  plt.xlabel('Building Dimension')
  plt.ylabel('Frequency')
  plt.show()


In [None]:
plot_variable(dft,'Building Dimension')

**visualisation of outliers from the date of occupancy**

In [None]:

def plot_variable(dft,variable):
  plt.figure(figsize = (10,4))
  # histogram
  plt.subplot(1,2,1)
  plt.hist(dft[variable], alpha = 0.5)
  plt.title('Histogram for the distribution of date of occupancy')
  plt.xlabel('Date_of_Occupancy')
  plt.ylabel('Frequency')
  # boxplot
  plt.subplot(1,2,2)
  sns.boxplot(dft[variable])
  plt.xlabel('Date_of_Occupancy')
  plt.ylabel('Frequency')
  plt.title('A boxplot for the distribution of date of occupancy')
  plt.show()

In [None]:
plot_variable(dft,'Date_of_Occupancy')

dealing with outliers from the date of occupancy

In [None]:
# using the Z score method to deal with outliers in  the Date_of_Occupancy variable
upper_limit = dft['Date_of_Occupancy'].mean() + 2.5*dft['Date_of_Occupancy'].std()
lower_limit = dft['Date_of_Occupancy'].mean() - 2.5*dft['Date_of_Occupancy'].std()

In [None]:
print('Upper_limit:',upper_limit)
print('Lower_limit:',lower_limit)

In [None]:
#  trimming  the outliers
dfnew = dft.loc[(dft['Date_of_Occupancy']<upper_limit) & (dft['Date_of_Occupancy']>lower_limit)]
print('old dataframe:',len(dft))
print('new dataframe:',len(dfnew))
print('outliers:',len(dft)-len(dft))

now some columns have been trimmed

visualisation of the date of occupancy column without outliers

In [None]:
def plot_variable(df,variable):
  plt.figure(figsize = (10,4))
  # histogram
  plt.subplot(1,2,1)
  plt.hist(df[variable], alpha = 0.5)
  plt.title('Histogram for the distribution of date of occupancy')
  plt.xlabel('Date_of_Occupancy')
  plt.ylabel('Frequency')
  # boxplot
  plt.subplot(1,2,2)
  sns.boxplot(df[variable])
  plt.xlabel('Date_of_Occupancy')
  plt.ylabel('Frequency')
  plt.title('A boxplot for the distribution of date of occupancy')
  plt.show()

In [None]:
plot_variable(dfnew,'Date_of_Occupancy')

In [None]:
dft.isnull().sum()

**The dataset now has  no missing values ,good to go**

**Distribution of the target varriable Claim**

In [None]:
dft['Claim'].value_counts()

In [None]:
# visualising  distribution for target variable
plt.figure(figsize=(4,5))
claim_rate =dft["Claim"].value_counts()
sns.barplot(x=claim_rate.index,y=claim_rate.values,palette=["#1d7874","#8B0000"])
plt.title("insurance claim Counts",fontweight="black",size=15,pad=20)
for i, v in enumerate(claim_rate.values):
    plt.text(i, v, v,ha="center", fontweight='black', fontsize=18)


In [None]:
# Sample data (replace with your actual data)
labels = ['Claimed.', 'No Claimed']
values = [dft["Claim"].sum(), len(dft) - dft["Claim"].sum()]
# Create a pie chart with labeled segments
plt.pie(values, autopct='%1.1f%%', startangle=140,explode=[0.3,0])
plt.title("Insurance claim Distribution")

plt.axis('equal')  # Equal aspect ratio ensures that the pie chart is circular
plt.legend(labels=labels, loc='lower left')
plt.show()

**77.2% of customer did not issue and insurance claim 22.8% of customers issued an insuarance claim this shows that The target varriable is highly imbalanced with 0(no claims) much higher than the 1(atleast one claim over insured period ) meaning it will further require to be balanced using some machine learning aligorithms  **

**visualising the categorical values distribution**

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(10, 10))
sns.set_theme(style="darkgrid")
# Plot the Claim distribution
sns.countplot(data=dft, x='Claim', ax=axes[0, 0], palette='Set2')
axes[0, 0].set_title('Claim Distribution')

# Plot the Residential distribution
sns.countplot(data=dft, x='Residential', ax=axes[0, 1], palette='Set2')
axes[0, 1].set_title('Residential Distribution')


# Plot the Building Painted distribution
sns.countplot(data=dft, x='Building_Painted', ax=axes[0, 2], palette='Set2')
axes[0, 2].set_title('Building Painted Distribution')

# Plot the Building Fenced distribution
sns.countplot(data=dft, x='Building_Fenced', ax=axes[1, 0], palette='Set2')
axes[1, 0].set_title('Building Fenced Distribution')

#Plot the Garden distribution
sns.countplot(data=dft, x='Garden', ax=axes[1, 1],palette='Set2')
axes[1, 1].set_title('Garden Distribution')

# Plot the Settlement distribution
sns.countplot(data=dft, x='Settlement', ax=axes[1, 2],palette='Set2')
axes[1, 2].set_title('Settlement Distribution')

#Plot the Building_Type distribution
sns.countplot(data=dft, x='Building_Type', ax=axes[2, 0],palette='Set2')
axes[2, 0].set_title('Building_Type Distribution')


#Plot the Insured_Period distribution
#sns.countplot(data=df, x='Insured_Period', ax=axes[2, 1],palette='Set3')
#axes[2, 1].set_title('Insured_Period Distribution')

#Plot the Building_Type distribution
#sns.countplot(data=df, x='Building_Type', ax=axes[2, 1],palette='Set3')
#axes[2, 1].set_title('Building_Type Distribution')

fig.delaxes(axes[2, 1])
fig.delaxes(axes[2, 2])

# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()


**visualising the continous varriables**

In [None]:
numerical_cont=['Building Dimension','Date_of_Occupancy']

In [None]:
plt.figure(figsize=(8, 8))
for i, column in enumerate(numerical_cont, 1):
    plt.subplot(2, 2, i)
    sns.histplot(dft[column], bins=20, kde = True)
plt.tight_layout()
plt.show()

**There are outliers in Building dimension columns the fact its skewed to the left as well in the data_of_occupancy since its  right skewing**

**Distribution of   categorical variables with respect to claim (target varriable)**

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15, 10))
sns.set_theme(style="darkgrid")
# Plot the Claim distribution
sns.countplot(data=dft, x='Claim', ax=axes[0, 0], hue = 'Claim', palette='Set2')
axes[0, 0].set_title('Claim Distribution')

# Plot the Residential distribution
sns.countplot(data=dft, x='Residential', ax=axes[0, 1], hue = 'Claim',palette='Set2')
axes[0, 1].set_title('Residential Distribution')


# Plot the Building Painted distribution
sns.countplot(data=dft, x='Building_Painted', ax=axes[0, 2], hue = 'Claim',palette='Set2')
axes[0, 2].set_title('Building Painted Distribution')
# Plot the Building Fenced distribution
sns.countplot(data=dft, x='Building_Fenced', ax=axes[1, 0], hue = 'Claim',palette='Set2')
axes[1, 0].set_title('Building Fenced Distribution')

#Plot the Garden distribution
sns.countplot(data=dft, x='Garden', ax=axes[1, 1], hue = 'Claim',palette='Set2')
axes[1, 1].set_title('Garden Distribution')

# Plot the Settlement distribution
sns.countplot(data=dft, x='Settlement', ax=axes[1, 2], hue = 'Claim',palette='Set2')
axes[1, 2].set_title('Settlement Distribution')

#Plot the Building_Type distribution
sns.countplot(data=dft, x='Building_Type', ax=axes[2, 0], hue = 'Claim',palette='Set2')
axes[2, 0].set_title('Building_Type Distribution')


#Plot the Insured_Period distribution
#sns.countplot(data=df, x='Insured_Period', ax=axes[2, 1],palette='Set3')
#axes[2, 1].set_title('Insured_Period Distribution')


fig.delaxes(axes[2, 1])
fig.delaxes(axes[2, 2])

# Adjust the layout
plt.tight_layout()

# Show the plot
plt.show()


**Observations**


*   The non painted buildings have a higher insurance claim campared to the houses for the first bar graphpainted



*   The non fenced buildings have a higher insurance claim campared to the fensed houses
*   The buildings without gardens a higher insurance claim campared to those buildings with gardens

*   Buildings in the Urban Settlement have a low insurance claim as compared to those in rural setting






.

**Distribution of the target varriable with respect to continous varriable**

In [None]:
dft.columns

In [None]:
figurecc=['Residential','Building Dimension','Building_Type','Date_of_Occupancy',]

In [None]:
list(enumerate(figurecc))

In [None]:

plt.suptitle("Distribution of Claim with respect to various numerical variables")
plt.figure(figsize=(10,8))
for i in enumerate(figurecc):
  plt.subplot(2,2,i[0]+1)
  plt.suptitle("Distribution of Claim with respect to various numerical variables")
  plt.tight_layout()
  sns.histplot(x=i[1], hue ='Claim', data = dft, palette=['red','blue'])
  plt.xticks(rotation = 45)

**observation**

There was a high insurance claim in the year 1960 according to Date of occupancy bar graph.

Buildings of dimensions between 500 to 2000 have a higher insurance claim

Non residential buildings have a higer insurance claim claim as compared to the residential buildings

Buildings with type number 2 have a hot a higher insurance claim as opposed to other number types**

---



**Data encoding**

In [None]:
# Create a LabelEncoder
label_encoder = LabelEncoder()

# Encode categorical columns
categorical_columns = ["Building_Painted", "Building_Fenced", "Garden", "Settlement","NumberOfWindows",]
for column in categorical_columns:
    dft[column] = label_encoder.fit_transform(dft[column])
    print(dft)

In [None]:
#displaying the encoded dataset
dft.head()

**feature selection (selecting the best features for the model)**

In [None]:
dft.head()

In [None]:
dft.isna().sum()

In [None]:
#Specification of independent and dependent variables of dataset 1
x= dft.drop(columns=['Claim'])  # Features
y = dft['Claim']  # Target variable

In [None]:
#Feature selection using chi-squared statistics and ANOVA F-statistic
from sklearn.feature_selection import SelectKBest, chi2, f_classif

# Calculate chi-squared statistics for each feature
chi2_scores = chi2(x, y)[0]

# Calculate ANOVA F-statistic and p-values for each feature
f_scores = f_classif(x, y)[0]

# Combine chi-squared and ANOVA scores
combined_scores = chi2_scores + f_scores

feature_scores = pd.DataFrame({'Feature': x.columns, 'Combined_Score': combined_scores})
feature_scores = feature_scores.sort_values(by='Combined_Score', ascending=False)


In [None]:
feature_scores

In [None]:
sns.barplot(feature_scores, y ='Feature', x ='Combined_Score')
plt.xscale('log')

From the graph,Building dimension is the most important variable because they have a higher chi2 score and a lower p-values from the graphs above

**spiltting the dataset into training and testing**

In [None]:
#splitting the dataset
from sklearn.model_selection import train_test_split

# Assuming dft is your DataFrame
features = ['YearOfObservation', 'Insured_Period', 'Residential', 'Building_Painted', 'Building_Fenced', 'Garden', 'Settlement', 'Building Dimension', 'Building_Type', 'Date_of_Occupancy']
target = "Claim"

X = dft[features]
y = dft[target]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scalling the data

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X_train

In [None]:
X_test

**Model Building**

**K-Nearest Neighbor(KNN) model**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
# Create the K-nearest Neighbours Classifier and use the train dataset to train the model
#accuracy = np.zeros(20)
from sklearn.model_selection import train_test_split

# Assuming X and y are your features and target variable respectively
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
#predicting the model
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Evaluating the K-nearest Neighbours model using classification report
print(classification_report(y_test,y_pred))

**Accuracy is the overall correct predictions divided by the total number of predictions. The overall accuracy is 0.73, meaning that the model correctly predicted the class for 73% of the instances.**

hyperparameter tunning for KNN model

In [None]:
grid_params = { 'n_neighbors' : [2,6,10,12,20],
               'weights' : ['uniform','distance'],
               'metric' : ['minkowski','euclidean','manhattan']}

In [None]:
gs = GridSearchCV(KNeighborsClassifier(), grid_params, verbose = 1, cv=3, n_jobs = -1)

In [None]:
gs = GridSearchCV(KNeighborsClassifier(), grid_params, verbose = 1, cv=3, n_jobs = -1)

In [None]:
# fit the model on our train set
g_res = gs.fit(X_train, y_train)

In [None]:
# find the best score
g_res.best_score_

In [None]:
# get the hyperparameters with the best score
g_res.best_params_

In [None]:
# use the best hyperparameters
g_res.best_params_
knn = KNeighborsClassifier(n_neighbors = 20, weights = 'uniform',algorithm = 'brute',metric = 'minkowski')
knn.fit(X_train, y_train)

In [None]:
# get a prediction
y_hat = knn.predict(X_train)
y_knn = knn.predict(X_test)

**model  evaluation**

In [None]:
print('Test set accuracy: ', accuracy_score(y_test, y_knn))
sc = accuracy_score(y_test, y_knn)

In [None]:
from sklearn.metrics import RocCurveDisplay,roc_curve
# Visualisation of the models's performance on an ROC/AUC curve
plt.figure(figsize=(5,4))
y_pred_= knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
auc_score_knn = roc_auc_score(y_test, y_pred)
plt.plot(fpr, tpr, label='K-nearest Neighbours Classifier')
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.legend(loc='lower right')
plt.title('ROC Curve for K-nearest Neighbours Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

**Naive Bayes**

In [None]:
# create the Naive Bayes' Classifier and use the train dataset to train the model
classifier=GaussianNB()
classifier.fit(X_train, y_train)

# predict the results of the model
y_predictnb=classifier.predict(X_test)
y_predict_proba = classifier.predict_proba(X_test)[:,1]

# Evaluating the model using classification report
print(classification_report(y_test,y_predictnb))

Overall accuracy is 0.76, meaning that the model correctly predicts the class for about 76% of the instances.

**Tune hyparameters for Naive Bayes**

In [None]:
np.logspace(0,-9, num=10)

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold

cv_method = RepeatedStratifiedKFold(n_splits=5,
                                    n_repeats=3,
                                    random_state=999)

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PowerTransformer


In [None]:
# Assuming X_train and y_train are your training data
Data_transformed = PowerTransformer().fit_transform(X_train)

# Create the Naive Bayes model
model_NB = GaussianNB()

# Define the parameter grid
params_NB = {'var_smoothing': np.logspace(0, -9, num=100)}

# Create the GridSearchCV object
gs_NB = GridSearchCV(estimator=model_NB,
                     param_grid=params_NB,
                     cv=cv_method,
                     verbose=1,
                     scoring='accuracy')

# Fit the grid search to the data
gs_NB.fit(Data_transformed, y_train)

In [None]:
gs_NB.best_params_

In [None]:
gs_NB.best_score_

In [None]:
results_NB = pd.DataFrame(gs_NB.cv_results_['params'])
results_NB['test_score'] = gs_NB.cv_results_['mean_test_score']

In [None]:
plt.plot(results_NB['var_smoothing'], results_NB['test_score'], marker = '.')
plt.xlabel('Var. Smoothing')
plt.ylabel("Mean CV Score")
plt.title("NB Performance Comparison")
plt.show()

In [None]:
# predict the target on the test dataset
predict_test = gs_NB.predict(Data_transformed)

In [None]:
# Accuracy Score on test dataset
accuracy_test = accuracy_score(y_train,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

In [None]:
# Visualisation of the models's performance on an ROC/AUC curve
plt.figure(figsize=(5,4))
y_predictnb_proba = classifier.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test,  y_predictnb_proba)
nb_roc_auc2=roc_auc_score(y_test,classifier.predict(X_test))
plt.plot(fpr, tpr, label='Naive Bayes Classifier')
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.legend(loc='lower right')
plt.title('ROC Curve for Naive Bayes Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

**Building a linear regression model**

In [None]:
#creating a model
from sklearn.linear_model import LinearRegression
# creating a object
model = LinearRegression()
#training the model
model.fit(X, y)
#using the training dataset for the prediction
pred = model.predict(X)
#model performance
from sklearn.metrics import r2_score, mean_squared_error
mse = mean_squared_error(y, pred)
r2 = r2_score(y, pred)#Best fit lineplt.scatter(x, y)
#plt.plot(X, pred, color = 'Black', marker = 'o')
#Results
print("Mean Squared Error : ", mse)
print("R-Squared :" , r2)
print("Y-intercept :"  , model.intercept_)
print("Slope :" , model.coef_)

The model's performance, as indicated by the R-squared value, is relatively low, suggesting that the linear regression model might not fully capture the underlying patterns in the data

**tuning hyperparameters to boost the model**





In [None]:
# Define the parameter grid
param_grid = {
    'fit_intercept': [True, False],
    'positive': [True, False]
}

In [None]:
from sklearn.model_selection import GridSearchCV
# Create the GridSearchCV object
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

In [None]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

In [None]:
# Get the best parameters and best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

In [None]:
# Make predictions on the test set
y_pred = best_model.predict(X_test)

In [None]:
# Evaluate the model
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error (MSE): {mse}')
print(f'Root Mean Squared Error (RMSE): {rmse}')
print(f'R-squared (R2): {r2}')


**SVM Model building**

In [None]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)


In [None]:
# predict the results of the model
y_predict_svm=support_vectorcf.predict(X_test)

In [None]:
# Evaluating the SVM model using classification report
print(classification_report(y_test,y_predict_svm))

The model performs well in predicting class 0 with high precision and recall. However, for class 1, the model has lower precision and recall, indicating challenges in correctly identifying instances of class 1. The low F1-score for class 1 suggests an imbalance between precision and recall. The overall accuracy is 77%, but it's crucial to consider the class-specific metrics, especially when dealing with imbalanced datasets.



In [None]:
# Evaluating the model's performance using a confusion matrix
cm_svm = confusion_matrix(y_test, y_predict_svm)
print(cm_svm)
accuracy_score(y_test, y_predict_svm)

In [None]:
accuracy=accuracy_score(y_test, y_predict_svm)
accuracy

In [None]:
# predict the results of the model
y_predict_svm = support_vectorcf.predict_proba(X_test)[:, 1]
# Creating instances (i.e. objects) of the roc curve
fpr, tpr, thresholds = roc_curve(y_test, y_predict_svm)
roc_auc = auc(fpr, tpr)

In [None]:
# Visualisation of the models's performance on an ROC/AUC curve
plt.figure(figsize=(5,4))
svm = SVC(probability=True)

# Train the model
svm.fit(X_train, y_train)

# Visualize the ROC curve
plt.figure(figsize=(5, 4))
# Use decision_function to get decision values
y_decision = svm.decision_function(X_test)

# Manually compute probabilities using decision values
y_pred_prob = (y_decision - y_decision.min()) / (y_decision.max() - y_decision.min())

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
nb_roc_auc2=roc_auc_score(y_test,svm.predict(X_test))
plt.plot(fpr, tpr, label='Support Vector Classifier')
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.legend(loc='lower right')
plt.title('ROC Curve for Support Vector Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

**The ROC is relativel good since is a bit far from the diagonal line**

**loading the test dataset and cleaning it**

In [None]:
dftest = pd.read_csv('/content/drive/MyDrive/MachineLearning/ML Lab/dataset/test_data.csv')
dftest.head()

In [None]:
dftest.isnull().sum()

In [None]:
dftest['Geo_Code'] = dftest['Geo_Code'].fillna(dftest['Geo_Code'].mode()[0])

In [None]:
dftest['Garden']= dftest['Garden'].fillna(dftest['Garden'].mode()[0])

In [None]:
#Distribution of data is skewed hence we use median
dftest['Building Dimension'] = dftest['Building Dimension'].fillna(dftest['Building Dimension'].median())

In [None]:
dftest.drop(columns=['NumberOfWindows'],inplace=True)

In [None]:
dftest['Date_of_Occupancy'] = dftest['Date_of_Occupancy'].fillna(dftest['Date_of_Occupancy'].median())

In [None]:
dftest.drop(columns=['Customer Id'], inplace = True)

In [None]:
dftest.head()

In [None]:
dftest['Building_Painted'] = dftest['Building_Painted'].replace({'V':0, 'N':1})

dftest['Building_Fenced']= dftest['Building_Fenced'].replace({'V':0, 'N':1})

dftest['Garden']= dftest['Garden'].replace({'O':0, 'V':1})

dftest['Settlement']= dftest['Settlement'].replace({'R':0, 'U':1})

dftest['Building_Type']= dftest['Building_Type'].replace({'1':0, '2':1, '3':2, '4':3})

In [None]:
dftest.head()

In [None]:
dftest.info()

In [None]:
columns_to_drop = ['NumberOfWindows','Geo_Code']
df2 = df2.drop(columns=columns_to_drop, axis=1)

In [None]:
dftest.info()

In [None]:
X_test = dftest.copy()

In [None]:
X_test=X_test.drop(['Geo_Code'],axis=1)

In [None]:
X_test.head()

**model building**

In [None]:
x = updated_dft.drop(columns=['Claim'])  # Features
y = updated_dft['Claim']  # Target variable

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(x, y, test_size= 0.2, random_state =0)

In [None]:
pd.set_option('display.max_columns',None)
X_train.head()


In [None]:
X_train.shape

In [None]:
X_train['Building Dimension']

**scaling **

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
columns_to_standardize = ['YearOfObservation','Building Dimension', 'Date_of_Occupancy']
X_train[columns_to_standardize] = scaler.fit_transform(X_train[columns_to_standardize])
X_test[columns_to_standardize] = scaler.transform(X_test[columns_to_standardize])

In [None]:
X_train.head()

**building svm**

A support-vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks like outliers detection. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin, the lower the generalization error of the classifier.

In [None]:

from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)



In [None]:
#Checking the default parameters in an SVC
clf.get_params()

In [None]:
#Predict the response for test dataset
y_pred = clf.predict(X_val)


In [None]:
#Calculating predictions, and accuracy score
pred_svc = clf.predict(X_val)
svm = accuracy_score(y_val,pred_svc)
svm

In [None]:

y_pred

In [None]:
#Building classification report
print(classification_report(y_val,pred_svc))

**printing the confusion matrix**

In [None]:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_val, pred_svc)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])

**printing the ROC curve**

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_val, pred_svc)

plt.figure(figsize=(6,4))

plt.plot(fpr, tpr, linewidth=2)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12

plt.title('svm classifier')

plt.xlabel('False Positive Rate (1 - Specificity)')

plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# compute ROC AUC

from sklearn.metrics import roc_auc_score

ROC_AUC = roc_auc_score(y_val, pred_svc)

print('ROC AUC : {:.4f}'.format(ROC_AUC))

**Hyperparameter Tuning with GridSearchCV**

In [None]:
# Create a dictionary called param_grid and fill out some parameters for kernels, C and gamma
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'poly', 'sigmoid']}


linear regression model

In [None]:
#Create a linear regression model
model = LinearRegression()

In [None]:
#Fit the model on the training data
model.fit(X_train, y_train)

In [None]:
# Step 5: Make predictions on the test set
y_pred = model.predict(X_val)

In [None]:

from sklearn.metrics import mean_squared_error
# Step 6: Evaluate the model before hyperparameter tuning
mse = mean_squared_error(y_val, y_pred)
r2 = accuracy_scoraccuracy_scoreaccuracy_scoree(y_val, y_pred)
rmse = mean_squared_error(y_val, y_pred, squared=False)
print(f'Mean Squared Error (Before Hyperparameter Tuning): {mse}')
print(f'Root Mean Squared Error (Before Hyperparameter Tuning): {rmse}')
print(f'R-squared (R2): {r2}')

In [None]:
from sklearn.linear_model import LinearRegression
linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train, y_train)
y_pred =clf.predict(X_test)
y_pred

In [None]:
from sklearn.neighbors import KNeighborsClassifier
KNN_classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
KNN_classifier.fit(X_train, y_train)
y_pred

In [None]:
from sklearn.naive_bayes import GaussianNB
NaiveBayes_classifier = GaussianNB()
NaiveBayes_classifier.fit(X_train, y_train)
y_pred

In [None]:
# Example prediction using the trained models
svm_predictions = clf.predict(X_test)
linear_reg_predictions = linear_reg_model.predict(X_test)
kNN_predictions = KNN_classifier.predict(X_test)
nb_predictions = NaiveBayes_classifier.predict(X_test)