## PART A

### Importing necessary libraries

In [5]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### loading the dataset

In [7]:
data = pd.read_csv(r'C:\Users\Varshan\Fall2023\practical_labs\datasets\Lab_2\Lab2_dataset.csv')

### Preprocessing

In [11]:
X = data['text']
y = data['label']

### Convert text to vectors using CountVectorizer

In [12]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)

### Spliting the dataset into training and testing sets

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Training and evaluating the models
### Support Vector Classifier (SVC)

In [14]:
svc_model = SVC()
svc_model.fit(X_train, y_train)
svc_predictions = svc_model.predict(X_test)

### Gaussian Naive Bayes


In [15]:
gnb_model = GaussianNB()
gnb_model.fit(X_train.toarray(), y_train)
gnb_predictions = gnb_model.predict(X_test.toarray())

### Multinomial Naive Bayes

In [16]:

mnb_model = MultinomialNB()
mnb_model.fit(X_train, y_train)
mnb_predictions = mnb_model.predict(X_test)

### Evaluation metrics

In [17]:

def evaluate_model(predictions, model_name):
    accuracy = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions, average='weighted')
    recall = recall_score(y_test, predictions, average='weighted')
    f1 = f1_score(y_test, predictions, average='weighted')
    
    print(f"{model_name} Model Evaluation:")
    print(f"Accuracy: {accuracy:.2f}")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1-score: {f1:.2f}")

evaluate_model(svc_predictions, "SVC")
evaluate_model(gnb_predictions, "Gaussian Naive Bayes")
evaluate_model(mnb_predictions, "Multinomial Naive Bayes")


SVC Model Evaluation:
Accuracy: 0.97
Precision: 0.97
Recall: 0.97
F1-score: 0.97
Gaussian Naive Bayes Model Evaluation:
Accuracy: 0.95
Precision: 0.95
Recall: 0.95
F1-score: 0.95
Multinomial Naive Bayes Model Evaluation:
Accuracy: 0.98
Precision: 0.98
Recall: 0.98
F1-score: 0.98


## REASONS

The differences in performance can be attributed to the different assumptions and underlying algorithms of the models. SVC does not make the same assumptions as Naive Bayes models and can handle complex decision boundaries, which may explain its good performance. Multinomial Naive Bayes is well-suited for text classification tasks and performed exceptionally well in this case.

## PART B

### Importing necessary libraries

In [24]:
import pandas as pd
import numpy as np
from scipy import stats

### Loading the dataset

In [26]:
data = pd.read_csv(r'C:\Users\Varshan\Fall2023\practical_labs\datasets\Lab_2\AB_NYC_2019.csv')

### Z-score approach for outlier removal

In [27]:
z_scores = np.abs(stats.zscore(data['price']))
threshold = 3
data_without_outliers_z = data[(z_scores < threshold)]

### Whiskers approach for outlier removal

In [28]:
Q1 = data['price'].quantile(0.25)
Q3 = data['price'].quantile(0.75)
IQR = Q3 - Q1
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
data_without_outliers_whiskers = data[(data['price'] >= lower_whisker) & (data['price'] <= upper_whisker)]

### Compare the cleaned datasets

In [29]:
print("Original dataset shape:", data.shape)
print("Shape after Z-score outlier removal:", data_without_outliers_z.shape)
print("Shape after whiskers outlier removal:", data_without_outliers_whiskers.shape)

Original dataset shape: (48895, 16)
Shape after Z-score outlier removal: (48507, 16)
Shape after whiskers outlier removal: (45923, 16)


## REASONS

Z score removes only the extreme outliers but whisker method removes including the mild outliers, so its neccessary to use these methods for the respective needs of data analysis.