**Intro to SVM**
- It is a supervised ML algo which is used for both regression and classification tasks
- Best effective towards binary classification problems

**Objective:**                                          
- SVM aims to find a hyperplane that best seperates different classes of data.                                   
- The goal is to maximize the margin, which is the distance between hyperplane and nearest data points.

**Linear SVM**
* In linear SVM, the hyperplane is a straight line(in 2D), a plane(in 3D), or a hyperplane in higher dimensions.   
* Its ideal for linearly seperable data

**Non linearly seperable**
* When data is not linearly seperable,SVM can use kernal function(e.g., polynomial, radial basis function) to transform the data into a higher-dimensional space where seperation is possible

**C parameter**
- SVM has a regularization parameter 'C' that controls the trade-off between maximizing the margin and minimizing the classification error.
- small 'C' - larger margin - allow some misclassification
- large 'C' - smaller margin - correct classification for all training data

**Kernal Trick**
- The kernal trick allows SVM to handle non-linearly seperable data by mapping it to a higher dimensional space where seperation is possible

**Soft Margin vs Hard Margin**
- Hard margin aims for stricter seperation of classes and may not be suitable for noisy data
- Soft margin SVM allows for some misclassification and is more robust to noisy data

**Multi-Class Classification**
- SVM can be extended for multi-class classification using techniques like one-vs-one or one-vs-all

**Regression with SVM**
- SVM can also be used for regression tasks, known as Support Vector Regression(SVR)
- In SVR, the goal is to fit a hyperplane that captures as many data points as possible within a specified margin

In [28]:
import pandas as pd

In [56]:
df=pd.read_csv(r"C:\Users\PHANEENDRA\Downloads\Cranes ML\CreditCardFraud.csv",index_col=0)
df

Unnamed: 0_level_0,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4996,29,3,40,92697,1,1.9,3,0,0,0,0,1,0
4997,30,4,15,92037,4,0.4,1,85,0,0,0,1,0
4998,63,39,24,93023,2,0.3,3,0,0,0,0,0,0
4999,65,40,49,90034,3,0.5,2,0,0,0,0,1,0


In [36]:
df['ZIP Code'].nunique()

467

In [38]:
df['ZIP Code'].dtype

dtype('int64')

In [42]:
df['ZIP Code']=df['ZIP Code'].astype('object')

In [44]:
df['ZIP Code'].dtype

dtype('O')

In [48]:
df['ZIP Code'].nunique()

467

In [50]:
df['ZIP Code'].value_counts()

ZIP Code
94720    169
94305    127
95616    116
90095     71
93106     57
        ... 
96145      1
94087      1
91024      1
9307       1
94598      1
Name: count, Length: 467, dtype: int64

In [52]:
df.isnull().sum()

Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal Loan         0
Securities Account    0
CD Account            0
Online                0
CreditCard            0
dtype: int64

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5000 entries, 1 to 5000
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 5000 non-null   int64  
 1   Experience          5000 non-null   int64  
 2   Income              5000 non-null   int64  
 3   ZIP Code            5000 non-null   object 
 4   Family              5000 non-null   int64  
 5   CCAvg               5000 non-null   float64
 6   Education           5000 non-null   int64  
 7   Mortgage            5000 non-null   int64  
 8   Personal Loan       5000 non-null   int64  
 9   Securities Account  5000 non-null   int64  
 10  CD Account          5000 non-null   int64  
 11  Online              5000 non-null   int64  
 12  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(11), object(1)
memory usage: 546.9+ KB


In [58]:
x=df.drop(columns='CreditCard')
y=df['CreditCard']

In [62]:
#Encoding of ZIP Code column
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
x['ZIP Code']= label_encoder.fit_transform(x['ZIP Code'])

In [66]:
x['ZIP Code'].nunique()

467

In [68]:
x

Unnamed: 0_level_0,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,25,1,49,83,4,1.6,1,0,0,1,0,0
2,45,19,34,34,3,1.5,1,0,0,1,0,0
3,39,15,11,367,1,1.0,1,0,0,0,0,0
4,35,9,100,298,1,2.7,2,0,0,0,0,0
5,35,8,45,96,4,1.0,2,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
4996,29,3,40,209,1,1.9,3,0,0,0,0,1
4997,30,4,15,141,4,0.4,1,85,0,0,0,1
4998,63,39,24,235,2,0.3,3,0,0,0,0,0
4999,65,40,49,15,3,0.5,2,0,0,0,0,1


* Scaling in done in SVM even though the values in column are not large,
* in scaling Standard Scaling is preferred in SVM

In [79]:
#Scaling
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_scaled=sc.fit_transform(x)

In [83]:
x_scaled=pd.DataFrame(x_scaled,columns=x.columns)
x_scaled.head()

Unnamed: 0,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online
0,-1.774417,-1.666078,-0.538229,-1.197395,1.397414,-0.193385,-1.049078,-0.555524,-0.325875,2.928915,-0.25354,-1.216618
1,-0.029524,-0.09633,-0.864109,-1.571905,0.525991,-0.250611,-1.049078,-0.555524,-0.325875,2.928915,-0.25354,-1.216618
2,-0.552992,-0.445163,-1.363793,0.973233,-1.216855,-0.536736,-1.049078,-0.555524,-0.325875,-0.341423,-0.25354,-1.216618
3,-0.90197,-0.968413,0.569765,0.445862,-1.216855,0.436091,0.141703,-0.555524,-0.325875,-0.341423,-0.25354,-1.216618
4,-0.90197,-1.055621,-0.62513,-1.098035,1.397414,-0.536736,0.141703,-0.555524,-0.325875,-0.341423,-0.25354,-1.216618


In [97]:
#train test split
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(x_scaled,y,test_size=0.2,random_state=1)

In [99]:
from sklearn.svm import SVC
svc=SVC()

svc.fit(xtrain,ytrain)

In [115]:
svc.score(xtrain,ytrain)

0.74525

In [101]:
svc.score(xtest,ytest)

0.755

In [107]:
svc1=SVC(kernel='linear')
svc1.fit(xtrain,ytrain)

In [109]:
svc1.score(xtest,ytest)

0.751

In [111]:
svc2=SVC(kernel='poly')
svc2.fit(xtrain,ytrain)

In [113]:
svc2.score(xtest,ytest)

0.755

**inference**
- here kernel='rbf' is default which is giving best accuracy as of now, also the kernel='poly' is giving same accuracy

In [124]:
ypred=svc.predict(xtest)

In [126]:
ypred

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,

In [122]:
from sklearn.metrics import confusion_matrix,classification_report

In [128]:
confusion_matrix(ytest,ypred)

array([[708,   7],
       [238,  47]], dtype=int64)

In [132]:
print(classification_report(ytest,ypred))

              precision    recall  f1-score   support

           0       0.75      0.99      0.85       715
           1       0.87      0.16      0.28       285

    accuracy                           0.76      1000
   macro avg       0.81      0.58      0.56      1000
weighted avg       0.78      0.76      0.69      1000



**LogisticRegression**

In [143]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()


In [145]:
lr.fit(xtrain,ytrain)

In [147]:
lr.score(xtrain,ytrain)

0.739

In [149]:
lr.score(xtest,ytest)

0.751

**KNN**

In [152]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()

In [154]:
knn.fit(xtrain,ytrain)

In [156]:
knn.score(xtrain,ytrain)

0.77875

In [158]:
knn.score(xtest,ytest)

0.698

**Naive Bayes**

In [167]:
from sklearn.naive_bayes import GaussianNB
nb=GaussianNB()
nb.fit(xtrain,ytrain)

In [169]:
nb.score(xtrain,ytrain)

0.73925

In [171]:
nb.score(xtest,ytest)

0.751

**Decision Tree**

In [176]:
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(xtrain,ytrain)

In [178]:
dt.score(xtrain,ytrain)

1.0

In [180]:
dt.score(xtest,ytest)

0.592

**Random Forest Classifier**

In [185]:
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(xtrain,ytrain)

In [187]:
rfc.score(xtrain,ytrain)

1.0

In [189]:
rfc.score(xtest,ytest)

0.737

# Inference
* Here we can observe out of all the classifiers SVC has the best training accuracy and also the best testing accuracy
* And the DTC and RFC has best training accuracy but not so good testing accuracy
* Even Logistic and NBC  are also on par with SVM, but KNN struggled with testing accuracy