PART A

In [1]:
# Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score

# Loading the dataset
data = pd.read_csv('/Users/engr/Downloads/Lab2_dataset.csv')

# Preprocessing
# Transforming the "text" feature to a vector representation
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, data['label'], test_size=0.2, random_state=42)

# Model Training and Evaluation
# Training the SVC model
svc_model = SVC()
svc_model.fit(X_train, y_train)
svc_predictions = svc_model.predict(X_test)
svc_accuracy = accuracy_score(y_test, svc_predictions)

# Training the Gaussian Naive Bayes model
gnb_model = GaussianNB()
gnb_model.fit(X_train.toarray(), y_train)
gnb_predictions = gnb_model.predict(X_test.toarray())
gnb_accuracy = accuracy_score(y_test, gnb_predictions)

# Training the Multinomial Naive Bayes model
mnb_model = MultinomialNB()
mnb_model.fit(X_train, y_train)
mnb_predictions = mnb_model.predict(X_test)
mnb_accuracy = accuracy_score(y_test, mnb_predictions)

# Comparing the performance of all models
print("SVC Accuracy: {:.2f}".format (svc_accuracy))
print("Gaussian Naive Bayes Accuracy: {:.2f}".format (gnb_accuracy))
print("Multinomial Naive Bayes Accuracy: {:.2f}".format (mnb_accuracy))


SVC Accuracy: 0.97
Gaussian Naive Bayes Accuracy: 0.95
Multinomial Naive Bayes Accuracy: 0.98


The Multinomial Naive Bayes model achieved the highest accuracy (98%) in text classification due to its compatibility with word frequency data. Support Vector Machine (SVC) performed well (97%) but may have overfit. Gaussian Naive Bayes (95%) suffered due to its independence assumption, which isn't ideal for correlated text features. Choosing the right model depends on data compatibility; Multinomial Naive Bayes excelled in this context, showcasing the importance of aligning algorithms with data characteristics for optimal performance.

PART B

In [2]:
import pandas as pd
from scipy.stats import zscore

# Loading the dataset
data = pd.read_csv('/Users/engr/Downloads/AB_NYC_2019.csv')

# Z-Score approach: Remove outliers
z_scores = zscore(data['price'])
z_threshold = 3
data_zscore = data[(z_scores < z_threshold)]

# Whiskers approach (IQR): Remove outliers
Q1 = data['price'].quantile(0.25)
Q3 = data['price'].quantile(0.75)
IQR = Q3 - Q1
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
data_whiskers = data[(data['price'] >= lower_whisker) & (data['price'] <= upper_whisker)]

# Choosing the cleaner dataset
if data_zscore.shape[0] > data_whiskers.shape[0]:
    clean_data = data_zscore
    method_used = "Z-score approach"
else:
    clean_data = data_whiskers
    method_used = "Whiskers approach"

# Check the shape of the final cleaned dataset
print("Shape of the final cleaned dataset using", method_used, ":", clean_data.shape)


Shape of the final cleaned dataset using Z-score approach : (48507, 16)


Using the Z-score approach, the dataset was cleaned, resulting in 48,507 rows and 16 columns. This method, based on standard deviation, assumed a normal distribution and removed outliers. The choice of approach (Z-score vs. whiskers) depends on data distribution; Z-score is suitable for approximately normal data, while whiskers (IQR) is robust for skewed data, preserving more points without assuming normality. The Z-score method retained 48,507 rows, indicating its effectiveness in this context.