# Velon Murugathas _Lab 2_

## Part A

### Preprocessing

In [56]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

### Loading the dataset

In [57]:
df = pd.read_csv("C:/Users/User/Desktop/Fall2023/datasets/Lab2_dataset.csv")                                       # Loading the dataset
print(df.head())

   Unnamed: 0 label                                               text   
0         605   ham  Subject: enron methanol ; meter # : 988291\nth...  \
1        2349   ham  Subject: hpl nom for january 9 , 2001\n( see a...   
2        3624   ham  Subject: neon retreat\nho ho ho , we ' re arou...   
3        4685  spam  Subject: photoshop , windows , office . cheap ...   
4        2030   ham  Subject: re : indian springs\nthis deal is to ...   

   label_num  
0          0  
1          0  
2          0  
3          1  
4          0  


### Use the CountVectorizer function in sklearn to transform the "text" feature to a vector representation of a predetermined size

In [58]:
vectorizer = CountVectorizer()                                                                               # Using CountVectorizer to convert text to a bag-of-words representation

X = vectorizer.fit_transform(df['text'])                                                                     

### Split the dataset into training and testing

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, df['label'], test_size=0.2, random_state=42)          # Splitting the dataset for training and testing

## Model Training and Evaluation

### Train the Sklearn SVC model on the training dataset and evaluate on the test set

In [60]:

# Train Sklearn SVC model

svc_model = SVC()                                           # Create an instance of the Support Vector Classifier (SVC)
svc_model.fit(X_train, y_train)                             # Training the SVC model using the training data (X_train and y_train)
svc_predictions = svc_model.predict(X_test)                 # Make predictions

svc_accuracy = accuracy_score(y_test, svc_predictions)

# Evaluating the model on test dataset
      
print("SVC Accuracy:", svc_accuracy)
print("SVC Classification Report:")
print(classification_report(y_test, svc_predictions))

SVC Accuracy: 0.9652173913043478
SVC Classification Report:
              precision    recall  f1-score   support

         ham       0.98      0.97      0.98       742
        spam       0.93      0.95      0.94       293

    accuracy                           0.97      1035
   macro avg       0.95      0.96      0.96      1035
weighted avg       0.97      0.97      0.97      1035



### Train and evaluate also on the Gaussian and Multinomial Naiive Bayes Classifiers

In [61]:
# Training Gaussian Naiive Bias model

gaussian_nb_model = GaussianNB()                                            # Creating an instance of the Gaussian Naive Bayes classifier
gaussian_nb_model.fit(X_train.toarray(), y_train)                           # Fit the model to training the data
gaussian_nb_predictions = gaussian_nb_model.predict(X_test.toarray()) 


gaussian_nb_accuracy = accuracy_score(y_test, gaussian_nb_predictions)
print("Gaussian Naive Bayes Accuracy:", gaussian_nb_accuracy)
print("Gaussian Naive Bayes Classification Report:")
print(classification_report(y_test, gaussian_nb_predictions))

# Training Multinomial Naiive Bias model

multinomial_nb_model = MultinomialNB()                                     # Creating an instance of the Multinomial Naive Bayes classifier
multinomial_nb_model.fit(X_train, y_train)                                 # Fit the model to training the data
multinomial_nb_predictions = multinomial_nb_model.predict(X_test)

multinomial_nb_accuracy = accuracy_score(y_test, multinomial_nb_predictions)
print("Multinomial Naive Bayes Accuracy:", multinomial_nb_accuracy)
print("Multinomial Naive Bayes Classification Report:")
print(classification_report(y_test, multinomial_nb_predictions))

Gaussian Naive Bayes Accuracy: 0.9545893719806763
Gaussian Naive Bayes Classification Report:
              precision    recall  f1-score   support

         ham       0.95      0.99      0.97       742
        spam       0.96      0.87      0.92       293

    accuracy                           0.95      1035
   macro avg       0.96      0.93      0.94      1035
weighted avg       0.95      0.95      0.95      1035

Multinomial Naive Bayes Accuracy: 0.978743961352657
Multinomial Naive Bayes Classification Report:
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       742
        spam       0.96      0.96      0.96       293

    accuracy                           0.98      1035
   macro avg       0.97      0.97      0.97      1035
weighted avg       0.98      0.98      0.98      1035



### Compare between the performance of all models and comment on the reasons behind the differences seen between the three models.

* The Support Vector Classifier (SVC) achieved an accuracy of approximately 96.52% in spam detection. The SVC model demonstrated strong performance with high precision and recall values for both the "ham" (non-spam) and "spam" classes. The hyperparameter tuning and dataset characteristics, such as feature engineering could have played a role in SVC's better performance in contrast to Gaussian and Multinomial Naiive bias.
* While the Naive Bayes models make specific assumptions about the independence of features, SVC does not rely on those. This enables SVC to handle a wider range of data distributions and relationships among features, which could explain its different and potentially better performance in certain situations.

## Part B

### Remove outliers based on price per night for a given apartment/home.

In [62]:
import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv("C:/Users/User/Desktop/Fall2023/datasets/AB_NYC_2019.csv")                                                       # Loading the datasets

price_column = "price"


### Compare the Z-score approach and the whiskers approach in terms of who is better to remove the outliers in this case.

In [63]:
z_scores = np.abs(stats.zscore(df[price_column]))                                                   # Calculating Z-scores for the price column to identify outliers using the Z-score approach

threshold = 3                                                                                       # Set a threshold for identifying outliers based on Z-scores (threshold corresponds to approximately three standard deviations away from the mean)

outliers_zscore = df[z_scores > threshold]                                                          # Creating a DataFrame outliers_zscore containing rows with Z-scores greater than the threshold

df_clean_zscore = df[z_scores <= threshold]                                                         # Creating a DataFrame df_clean_zscore containing rows without outliers Z-score approach

Q1 = df[price_column].quantile(0.25)                                                                # Calculating the first quartile for price

Q3 = df[price_column].quantile(0.75)                                                                # Calculating the third quartile for price

IQR = Q3 - Q1                                                                                       # Calculating the interquartile range

lower_bound = Q1 - 1.5 * IQR                                                                        # Define lower bounds to identify outliers using the whiskers (IQR) approach
upper_bound = Q3 + 1.5 * IQR
outliers_whiskers = df[(df[price_column] < lower_bound) | (df[price_column] > upper_bound)]
df_clean_whiskers = df[(df[price_column] >= lower_bound) & (df[price_column] <= upper_bound)]

print("Number of outliers (Z-score approach):", len(outliers_zscore))
print("Number of outliers (Whiskers approach):", len(outliers_whiskers))
print("Size of cleaned dataset (Z-score approach):", len(df_clean_zscore))
print("Size of cleaned dataset (Whiskers approach):", len(df_clean_whiskers))

Number of outliers (Z-score approach): 388
Number of outliers (Whiskers approach): 2972
Size of cleaned dataset (Z-score approach): 48507
Size of cleaned dataset (Whiskers approach): 45923


### Compare the Z-score approach and the whiskers approach in terms of who is better to remove the outliers in this case.
Z-score is good at finding extreme outliers, but it can also label regular data as unusual. Whiskers are more cautious and work well with different data types without being overly sensitive to slight differences. 