# Python Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix, precision_score


## Getting the Features from the dataset

In [2]:
# Load the dataset into a Pandas DataFrame
df = pd.read_csv('station_data_dataverse.csv')

# Print all of the features in the dataset
print(df.columns)


Index(['sessionId', 'kwhTotal', 'dollars', 'created', 'ended', 'startTime',
       'endTime', 'chargeTimeHrs', 'weekday', 'platform', 'distance', 'userId',
       'stationId', 'locationId', 'managerVehicle', 'facilityType', 'Mon',
       'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun', 'reportedZip'],
      dtype='object')


# Adding New feature
Adding a new feature calls false data injection
## Code Explain:
The code reads the CSV file named "dataset.csv" using the pandas library and stores it in a pandas DataFrame object called "data".

Then, a new column named "FalseDataInjection" is added to the DataFrame using the assign() function from pandas. The values of this new column are generated randomly with numpy.random.randint() function, which generates integers randomly between 0 and 1 (inclusive). The size parameter specifies the number of rows in the DataFrame.

Finally, the updated DataFrame with the new column is saved to a new CSV file named "dataset_with_false_injection.csv" using the to_csv() function from pandas. The index=False parameter ensures that the row index is not included in the output file.

In [3]:
# read in dataset
df = pd.read_csv('station_data_dataverse.csv')

# add FalseDataInjection column with random 0 or 1 values
df['FalseDataInjection'] = np.random.randint(2, size=len(df))

# save new dataset to CSV file
df.to_csv('dataset_with_FDI.csv', index=False)


## Checking for any null or empty value
The above code reads the CSV file named "dataset_with_FDI.csv" using pandas library's read_csv function and stores it in a variable named "df".

The next line uses the isnull() method of pandas dataframe to check for null values in each cell of the dataframe "df". It returns a boolean dataframe where True indicates a null value and False indicates a non-null value.

Then, sum() method is used on this boolean dataframe which will return a new dataframe with the sum of all null values in each column of the original dataframe. This will give the count of null values in each column.

Finally, the last line of code will print the count of null values in each column of the dataframe.

## Output Explain:
This output shows the number of null or missing values in each column of the dataset. For example, the 'distance' column has 1065 missing values while other columns such as 'sessionId', 'kwhTotal', 'dollars', 'created', etc have no missing values. The 'FalseDataInjection' column also has no missing values as expected since we added random values of either 0 or 1 for this feature.



In [4]:
# Check for any null or missing values in the DataFrame
print(df.isnull().sum())

sessionId                0
kwhTotal                 0
dollars                  0
created                  0
ended                    0
startTime                0
endTime                  0
chargeTimeHrs            0
weekday                  0
platform                 0
distance              1065
userId                   0
stationId                0
locationId               0
managerVehicle           0
facilityType             0
Mon                      0
Tues                     0
Wed                      0
Thurs                    0
Fri                      0
Sat                      0
Sun                      0
reportedZip              0
FalseDataInjection       0
dtype: int64


# Removing any null or empty cell from the data set:
The code uses the dropna function from pandas to remove any rows containing null or missing values from the dataset. The argument inplace=True specifies that the changes should be made to the original dataset.

This code will remove any rows with missing or null values from the dataset, allowing for cleaner and more accurate data for analysis.

In [None]:
# Drop all rows with null or empty values
df = df.dropna()

# Save the cleaned dataset to a new csv file
df.to_csv('cleaned_dataset.csv', index=False)

# Converting Text Data into numberical representation using One-hot conversion

## Code explanation:
In this code, we first load the dataset into a pandas DataFrame object using the read_csv() function. Then, we use the get_dummies() function to perform one-hot encoding on the Weekday feature, which creates a new DataFrame with binary features for each weekday. We add these new features to the original DataFrame using the concat() function, and then we drop the original Weekday feature using the drop() function. Finally, we print the updated DataFrame using the head() function.

After performing one-hot encoding, you can use the other features in the SVM model to detect false data injection attacks.

In [None]:
# Perform one-hot encoding on the 'Weekday' feature
one_hot = pd.get_dummies(df['weekday'], prefix='weekday')

# Add the new one-hot encoded features to the original dataset
df = pd.concat([df, one_hot], axis=1)

# Drop the original 'Weekday' feature since it's no longer needed
df.drop('weekday', axis=1, inplace=True)

# Print the updated dataset
print(df.head())

# Apply SVM Model

## What if we change the kernel of SVM will the performace of the model change
Yes, changing the kernel of SVM can impact the performance of the model. The choice of kernel depends on the nature of the data and the problem you are trying to solve.

In the code snippet provided earlier, the kernel used is 'rbf' which stands for radial basis function. This is a commonly used kernel that can work well with non-linearly separable data. However, other kernels such as linear, polynomial, and sigmoid can also be used depending on the problem at hand.

It's generally a good idea to experiment with different kernels and compare their performance using metrics such as accuracy, precision, recall, and F1-score. This can help you choose the best kernel for your particular problem.

## SVM Model with RBF Kernel & Box plot
## Code Explain:
The above code is implementing an SVM (Support Vector Machine) classifier using the Scikit-learn library in Python to predict the 'managerVehicle' target feature of the dataset based on the selected features.

Here's a breakdown of the code:

1. First, the necessary libraries are imported, including pandas for data manipulation, svm for implementing the SVM classifier, and cross_val_score for cross-validation of the model.

2. Then, the selected features for SVM are defined in a list called 'features'.

3. The dataset is split into two parts - the features (X) and the target (y). X contains all the features except the 'managerVehicle' target feature, and y contains only the 'managerVehicle' target feature.

4. An SVM classifier with the RBF kernel is created.

5. The performance of the model is evaluated using 5-fold cross-validation, and the mean accuracy and standard deviation of the cross-validation scores are printed.

**Note:** The dataset used in this code is assumed to be named 'df' after the cleaning process. However, the code is commented out for reading the cleaned dataset from a CSV file named 'cleaned_dataset.csv'.

## Output Explain:
The output means that the mean accuracy of the SVM classifier with RBF kernel is 0.64, with a standard deviation of 0.18. The cross-validation is performed using 5 folds, so the accuracy score is the average of the 5 scores obtained during the cross-validation process.

The standard deviation provides information about the variability of the accuracy scores obtained from the different folds. A higher standard deviation means that the accuracy scores are more spread out and less reliable, while a lower standard deviation indicates more consistent and reliable results.

In summary, the SVM classifier with RBF kernel has an average accuracy of 0.64, which means that it correctly predicts the value of the "managerVehicle" target variable in 64% of the cases, on average. However, the relatively high standard deviation of 0.18 suggests that the performance of the classifier may be variable and less reliable.

In [None]:
# Select the features for SVM
features = ['kwhTotal', 'dollars', 'chargeTimeHrs', 'distance', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun', 'FalseDataInjection']

# Split the data into X (features) and y (target)
X = df[features]
y = df['managerVehicle']

# Create SVM classifier with RBF kernel
clf = svm.SVC(kernel='rbf')

# Evaluate performance using 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

# Print the mean accuracy and standard deviation of the cross-validation scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

plt.boxplot(scores)
plt.title('Cross-validation scores')
plt.ylabel('Accuracy')
plt.show()


## SVM  With Confusion Matrix & With RBF Kernel:
This code is an extension of the previous code that trains an SVM classifier on the dataset and evaluates its performance using cross-validation. It adds a confusion matrix visualization to further analyze the classifier's performance.

After selecting the features for SVM and splitting the data into features (X) and target (y), a SVM classifier with an RBF kernel is created. Then, 5-fold cross-validation is used to evaluate the classifier's performance, and the mean accuracy and standard deviation of the cross-validation scores are printed.

Next, the SVM classifier is trained on the entire dataset, and its predictions on the target variable (y_pred) are obtained. Finally, the confusion matrix is calculated using the true labels (y) and predicted labels (y_pred), and a heatmap visualization of the matrix is plotted using the seaborn library.

The confusion matrix helps to visualize the performance of the classifier by showing the number of true positives, true negatives, false positives, and false negatives. The heatmap visualization makes it easier to interpret the matrix by highlighting the values using different colors.

## Output Explain:
The output means that the mean accuracy of the SVM classifier with RBF kernel is 0.64, with a standard deviation of 0.18. The cross-validation is performed using 5 folds, so the accuracy score is the average of the 5 scores obtained during the cross-validation process.

The standard deviation provides information about the variability of the accuracy scores obtained from the different folds. A higher standard deviation means that the accuracy scores are more spread out and less reliable, while a lower standard deviation indicates more consistent and reliable results.

In summary, the SVM classifier with RBF kernel has an average accuracy of 0.64, which means that it correctly predicts the value of the "managerVehicle" target variable in 64% of the cases, on average. However, the relatively high standard deviation of 0.18 suggests that the performance of the classifier may be variable and less reliable.

In [None]:
# Select the features for SVM
features = ['kwhTotal', 'dollars', 'chargeTimeHrs', 'distance', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun', 'FalseDataInjection']

# Split the data into X (features) and y (target)
X = df[features]
y = df['managerVehicle']

# Create SVM classifier with RBF kernel
clf = svm.SVC(kernel='rbf')

# Evaluate performance using 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

# Print the mean accuracy and standard deviation of the cross-validation scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# Train and test the SVM classifier on the data
clf.fit(X, y)
y_pred = clf.predict(X)

# Plot the confusion matrix
cm = confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt='g')


## SVM With Confusion Matrix & with Linear  Kernel:
This code is using SVM (Support Vector Machines) to classify electric vehicle charging sessions based on the features in the input dataset.

First, the necessary libraries (pandas, svm, cross_val_score, confusion_matrix, and seaborn) are imported.

The features used in the SVM model are defined and extracted from the cleaned dataset.

The data is split into two parts: X, which contains the features used for classification, and y, which contains the target variable (in this case, the binary label 'managerVehicle').

An SVM classifier is created using a linear kernel, and its performance is evaluated using 5-fold cross-validation. The accuracy and standard deviation of the cross-validation scores are printed to the console.

The SVM classifier is trained and tested on the data, and its predictions are stored in a variable y_pred.

Finally, a confusion matrix is generated using the true labels and the predicted labels, and it is visualized using a heatmap from the seaborn library.

By changing the kernel of the SVM model to linear, the accuracy of the model decreased to 0.58 (+/- 0.15) compared to 0.64 (+/- 0.18) with an RBF kernel.
## Output Explain:
The output "Accuracy: 0.58 (+/- 0.15)" means that the average accuracy of the SVM model using a linear kernel is 0.58, and the standard deviation of the accuracy across different cross-validation folds is 0.15.

In other words, the model is correctly predicting the target variable (managerVehicle) approximately 58% of the time on average. However, there is some variability in the accuracy across different folds, with a standard deviation of 0.15, which suggests that the performance of the model may not be very consistent.

It's worth noting that the accuracy of the model depends on many factors, including the quality of the data, the choice of features, and the choice of hyperparameters (such as the kernel type). Therefore, it's important to evaluate the model using different kernels and hyperparameters to choose the best model for the given problem.

In [None]:
# Select the features for SVM
features = ['kwhTotal', 'dollars', 'chargeTimeHrs', 'distance', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun', 'managerVehicle']

# Split the data into X (features) and y (target)
X = df[features]
y = df['FalseDataInjection']

# Create SVM classifier with linear kernel
clf = svm.SVC(kernel='linear')

# Evaluate performance using 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

# Print the mean accuracy and standard deviation of the cross-validation scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# Train and test the SVM classifier on the data
clf.fit(X, y)
y_pred = clf.predict(X)

# Calculate the precision score
precision = precision_score(y, y_pred)

# Print the precision score
print("Precision: %0.2f" % precision)

# Plot the confusion matrix
cm = confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt='g')

## SVM with Linear Kernel & checking the Precision Score

In [None]:
# Select the features for SVM
features = ['kwhTotal', 'dollars', 'chargeTimeHrs', 'distance', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun', 'managerVehicle']

# Split the data into X (features) and y (target)
X = df[features]
y = df['FalseDataInjection']

# Create SVM classifier with linear kernel
clf = svm.SVC(kernel='linear')

# Evaluate performance using 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

# Print the mean accuracy and standard deviation of the cross-validation scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# Train and test the SVM classifier on the data
clf.fit(X, y)
y_pred = clf.predict(X)

# Calculate the precision score
precision = precision_score(y, y_pred)

# Print the precision score
print("Precision: %0.2f" % precision)

# Plot the confusion matrix
cm = confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt='g')

## SVM With Confussion Matrix & With Polynomial Kernel:


## Output Explain:
In the context of the code provided, "Accuracy: 0.58 (+/- 0.11)" is the output of the cross-validation score evaluation performed on the SVM model with a polynomial kernel.

The first value, 0.58, represents the mean accuracy score of the model across all the folds of the cross-validation. This means that, on average, the model correctly predicted the vehicle manager for 58% of the data points.

The second value, +/- 0.11, represents the range of the accuracy scores across all the folds of the cross-validation. It is a measure of the variability of the model's performance. The range is expressed as twice the standard deviation of the accuracy scores. So, in this case, the range is from 0.47 to 0.69 (0.58 plus or minus 0.11).

Therefore, the output indicates that the model's performance is not very consistent, and the accuracy of the model may vary significantly depending on the data points used for training and testing.

In [None]:
# Select the features for SVM
features = ['kwhTotal', 'dollars', 'chargeTimeHrs', 'distance', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun', 'FalseDataInjection']

# Split the data into X (features) and y (target)
X = df[features]
y = df['managerVehicle']

# Create SVM classifier with polynomial kernel
clf = svm.SVC(kernel='poly')

# Evaluate performance using 5-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

# Print the mean accuracy and standard deviation of the cross-validation scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# Train and test the SVM classifier on the data
clf.fit(X, y)
y_pred = clf.predict(X)

# Plot the confusion matrix
cm = confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt='g')
