### Anomaly  Detection
Objective: Data Preprocessing and Feature Engineering

In [41]:
import pandas as pd
import numpy as np

#### Importing the dataset

In [42]:
df = pd.read_csv(r"C:\Users\dell\Downloads\anomaly-detection\train.csv")
df

Unnamed: 0,timestamp,value,is_anomaly,predicted
0,1425008573,42,False,44.072500
1,1425008873,41,False,50.709390
2,1425009173,41,False,81.405120
3,1425009473,61,False,39.950367
4,1425009773,44,False,35.350160
...,...,...,...,...
15825,1429756073,44,False,53.624115
15826,1429756373,45,False,59.752296
15827,1429756673,48,False,52.147630
15828,1429756973,26,False,58.007545


#### About the data from kaggle
The description of the column are as follows:

timestamp [ float ] : is provided as a Unix epoch in seconds.

value [ int ] : is a real value measurement of some metric at the timestamp.

is_anomaly [ boolean ] : is a boolean value which is True if the corresponding value is identified as an anomaly.

####  Checking datatypes

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15830 entries, 0 to 15829
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   timestamp   15830 non-null  int64  
 1   value       15830 non-null  int64  
 2   is_anomaly  15830 non-null  bool   
 3   predicted   15830 non-null  float64
dtypes: bool(1), float64(1), int64(2)
memory usage: 386.6 KB


####  Checking for missing values

In [44]:
df.isna().sum()

timestamp     0
value         0
is_anomaly    0
predicted     0
dtype: int64

#### Data preprocessing and feature engineering
As we can see the data is almost in perfect condition, there are no missing values and only one column with non numerical datatype.

#### Converting the is_anomaly column to numerical format

In [45]:
df['is_anomaly'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 15830 entries, 0 to 15829
Series name: is_anomaly
Non-Null Count  Dtype
--------------  -----
15830 non-null  bool 
dtypes: bool(1)
memory usage: 15.6 KB


In [46]:
df.columns

Index(['timestamp', 'value', 'is_anomaly', 'predicted'], dtype='object')

In [47]:
#for val in df['is_anomaly']:
 #   if val==False:
  #      val=0
   # else:
    #    val=1
#df
#The above code will not work in case of panda dataframes, for making such changes, we need the lambda function and the apply method

In [48]:
df['is_anomaly'] = df['is_anomaly'].apply(lambda x: 0 if x == False else 1)
df

Unnamed: 0,timestamp,value,is_anomaly,predicted
0,1425008573,42,0,44.072500
1,1425008873,41,0,50.709390
2,1425009173,41,0,81.405120
3,1425009473,61,0,39.950367
4,1425009773,44,0,35.350160
...,...,...,...,...
15825,1429756073,44,0,53.624115
15826,1429756373,45,0,59.752296
15827,1429756673,48,0,52.147630
15828,1429756973,26,0,58.007545


#### Test data
Now lets take a look at test data

In [49]:
df_test = pd.read_csv(r"C:\Users\dell\Downloads\anomaly-detection\test.csv")
df_test

Unnamed: 0,timestamp,value,predicted
0,1396332000,20.00000,20.000000
1,1396332300,20.00000,20.000000
2,1396332600,20.00000,20.000000
3,1396332900,20.00000,20.000000
4,1396333200,20.00000,20.000000
...,...,...,...
3955,1397518500,20.00384,19.836240
3956,1397518800,20.00384,19.207998
3957,1397519100,20.00384,20.103437
3958,1397519400,20.00384,19.346764


#### As we can see, we need to predict the is_anomaly column,
### Lets choose the model and fit it to training data

Lets note down some points which will help us in deciding the model
* This is a classification problem
* The data is labelled,
* We know the number of categories in target variable(2)

### The best model of choice is Linear SVC
LinearSVC is a part of the scikit-learn library and is used for performing classification tasks using a linear Support Vector Machine (SVM). It works well for high-dimensional datasets and is a good choice for binary or multi-class classification problems.

In [64]:
#Importing LinearSVC
from sklearn.svm import LinearSVC
model = LinearSVC()

#Fitting model to train data
x = df.drop(columns='is_anomaly')# independent variables
y = df['is_anomaly'] #Value to be predicted
# y_pred (test val to be predicted)
#Fitting the model on train data
model.fit(x,y)

#Predicting test
result = model.predict(df_test)#here result is y_pred while df_test is x_test
result



array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [51]:
len(result)

3960

#### Just to be clear lets define x_train,y_train, x_test,y_test, and y_pred
* x_train, y_train are input into the model to for fitting , x contains independent variables and y contains  contain dependent/target variable
* x_test ,y_test,are 2 parts of test dataset, out of which y_test is usually given in last moment of competition towards the end just to check if our y_pred or predicted value for test set is matching to real test values(y_test) or not.

### Model Evaluation
According to Kaggle, the evaluation metric for this project is Mean-F-Score,so lets evaluate our model according to this metric

In [65]:
#Lets import y_test
y_test = pd.read_csv(r"C:\Users\dell\Downloads\anomaly-detection\Submission.csv")

ytest=y_test[:3960]
ytest

Unnamed: 0,timestamp,is_anomaly
0,1425008573,False
1,1425008873,False
2,1425009173,False
3,1425009473,False
4,1425009773,False
...,...,...
3955,1426195073,True
3956,1426195373,True
3957,1426195673,True
3958,1426195973,True


#### Before checking the score of the model , we need our submissions/test data in numerical form.

In [69]:
ytest['is_anomaly'] = ytest['is_anomaly'].apply(lambda x: 0 if x == False else 1)
ytest

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ytest['is_anomaly'] = ytest['is_anomaly'].apply(lambda x: 0 if x == False else 1)


Unnamed: 0,timestamp,is_anomaly
0,1425008573,0
1,1425008873,0
2,1425009173,0
3,1425009473,0
4,1425009773,0
...,...,...
3955,1426195073,1
3956,1426195373,1
3957,1426195673,1
3958,1426195973,1


In [70]:
from sklearn.metrics import f1_score
# Calculate the mean F1-score (Macro F1-score)
mean_f1_macro = f1_score(ytest['is_anomaly'],result, average='macro')

# Calculate the mean F1-score (Micro F1-score)
mean_f1_micro = f1_score(ytest['is_anomaly'], result, average='micro')

# Print the results
print(f"Macro F1-Score: {mean_f1_macro:.2f}")
print(f"Micro F1-Score: {mean_f1_micro:.2f}")

Macro F1-Score: 0.49
Micro F1-Score: 0.96


In [None]:
Agnee Mohanty ,Registration number = 2141013185