## Lab 2 - Preprocessing and Loading the Dataset:
#### Name : Sam Hussain Hajanajumudeen
#### Student # : 8901770

### importing the necessary libraries:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from scipy import stats

### Load Lab2_dataset file into a pandas DataFrame:

In [2]:
spam_mail_df = pd.read_csv('CSVs\Lab2_dataset.csv')
spam_mail_df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\nth...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\n( see a...",0
2,3624,ham,"Subject: neon retreat\nho ho ho , we ' re arou...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\nthis deal is to ...,0


### Using the CountVectorizer function from sklearn to transform the "text" feature to vector representation of predetermined size

In [3]:
X = spam_mail_df['text']
y = spam_mail_df['label']
vectorizer = CountVectorizer(max_features=1000)
X_vectorized = vectorizer.fit_transform(X)

### Split the dataset into training and testing

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

### Model Training and Evalution
* Train the Sklearn SVC model on the training dataset and evaluate on the test set

In [5]:
# Initialize the SVC model
svm_model = SVC()

# Train the model on the training dataset
svm_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Generate a classification report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

Accuracy: 0.95
Classification Report:
               precision    recall  f1-score   support

         ham       0.98      0.95      0.96       742
        spam       0.88      0.96      0.92       293

    accuracy                           0.95      1035
   macro avg       0.93      0.95      0.94      1035
weighted avg       0.95      0.95      0.95      1035



* Train and evaluate also on the Gaussian and Multinomial Naiive Bayes Classifiers

In [6]:
# Initialize the Gaussian Naive Bayes model
gnb_model = GaussianNB()

# Train the Gaussian Naive Bayes model on the training dataset
gnb_model.fit(X_train.toarray(), y_train)  # Note: Convert X_train to an array for GaussianNB

# Make predictions on the test set
y_pred_gnb = gnb_model.predict(X_test.toarray())  # Note: Convert X_test to an array for GaussianNB

# Calculate accuracy for Gaussian Naive Bayes
accuracy_gnb = accuracy_score(y_test, y_pred_gnb)
print(f"Gaussian Naive Bayes Accuracy: {accuracy_gnb:.2f}")

# Generate a classification report for Gaussian Naive Bayes
report_gnb = classification_report(y_test, y_pred_gnb)
print("Gaussian Naive Bayes Classification Report:\n", report_gnb)

# Initialize the Multinomial Naive Bayes model
mnb_model = MultinomialNB()

# Train the Multinomial Naive Bayes model on the training dataset
mnb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_mnb = mnb_model.predict(X_test)

# Calculate accuracy for Multinomial Naive Bayes
accuracy_mnb = accuracy_score(y_test, y_pred_mnb)
print(f"Multinomial Naive Bayes Accuracy: {accuracy_mnb:.2f}")

# Generate a classification report for Multinomial Naive Bayes
report_mnb = classification_report(y_test, y_pred_mnb)
print("Multinomial Naive Bayes Classification Report:\n", report_mnb)

Gaussian Naive Bayes Accuracy: 0.94
Gaussian Naive Bayes Classification Report:
               precision    recall  f1-score   support

         ham       0.99      0.93      0.96       742
        spam       0.85      0.97      0.90       293

    accuracy                           0.94      1035
   macro avg       0.92      0.95      0.93      1035
weighted avg       0.95      0.94      0.94      1035

Multinomial Naive Bayes Accuracy: 0.93
Multinomial Naive Bayes Classification Report:
               precision    recall  f1-score   support

         ham       0.97      0.94      0.95       742
        spam       0.86      0.92      0.89       293

    accuracy                           0.93      1035
   macro avg       0.91      0.93      0.92      1035
weighted avg       0.94      0.93      0.93      1035



### Compare between the performance of all models and comment on the reasons behind the differences seen between the three models.

1. Support Vector Classifier (SVC):
* Assumptions: SVC does not make strong assumptions about the underlying data distribution. It aims to find the best hyperplane to separate data points, without relying on specific probability distributions.
* Flexibility: SVC is a flexible and non-parametric model that can capture complex relationships and adapt to various data distributions.

2. Gaussian Naive Bayes (GNB):
* Assumptions: GNB assumes that features within each class follow a Gaussian (normal) distribution. It's a parametric model with a specific probability distribution assumption.
* Suitability: GNB works well when the Gaussian assumption aligns with the data distribution. It may not perform optimally when the assumption is violated.

3. Multinomial Naive Bayes (MNB):
* Assumptions: MNB is designed for text data and assumes a multinomial distribution for discrete data, often used for word counts. It has a specific distributional assumption.
* Suitability: MNB is suitable for text classification tasks where the multinomial assumption is valid. It may not perform as well for non-textual data.

### PART.B

In [7]:
# Load the dataset
df = pd.read_csv('CSVs\AB_NYC_2019.csv')
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


* Remove outliers based on price per night for a given apartment/home.

In [8]:

# Define a function to remove outliers based on IQR
def remove_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]

# Remove outliers based on price per night
df_cleaned = remove_outliers_iqr(df, 'price')

# Save the cleaned dataset to a new file
df_cleaned.to_csv("cleaned_AB_NYC_2019.csv", index=False)
df_cleaned.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


* Z-Score Approach:

In [9]:
# Calculate Z-scores for the "price" column
z_scores = stats.zscore(df['price'])

# Define a Z-score threshold (e.g., 3 or -3)
z_score_threshold = 3

# Remove outliers based on Z-scores
df_cleaned_z = df[(z_scores < z_score_threshold) & (z_scores > -z_score_threshold)]

# Save the cleaned dataset using the Z-score approach
df_cleaned_z.to_csv("cleaned_AB_NYC_2019_ZScore.csv", index=False)

### Comparison

In [10]:
# Define a function to calculate summary statistics
def calculate_summary_stats(data):
    num_data_points = len(data)
    min_price = data['price'].min()
    max_price = data['price'].max()
    mean_price = data['price'].mean()
    std_price = data['price'].std()
    return num_data_points, min_price, max_price, mean_price, std_price

# Calculate summary statistics for each dataset
num_data_original, min_price_original, max_price_original, mean_price_original, std_price_original = calculate_summary_stats(df)
num_data_cleaned_z, min_price_cleaned_z, max_price_cleaned_z, mean_price_cleaned_z, std_price_cleaned_z = calculate_summary_stats(df_cleaned_z)
num_data_cleaned_iqr, min_price_cleaned_iqr, max_price_cleaned_iqr, mean_price_cleaned_iqr, std_price_cleaned_iqr = calculate_summary_stats(df_cleaned)

# Display the comparison results
print("Comparison of Results:")
print("Original dataset:")
print(f"Number of data points: {num_data_original}")
print(f"Min price: {min_price_original}, Max price: {max_price_original}")
print(f"Mean price: {mean_price_original}, Std price: {std_price_original}\n")

print("Cleaned dataset using Z-score approach:")
print(f"Number of data points: {num_data_cleaned_z}")
print(f"Min price: {min_price_cleaned_z}, Max price: {max_price_cleaned_z}")
print(f"Mean price: {mean_price_cleaned_z}, Std price: {std_price_cleaned_z}\n")

print("Cleaned dataset using IQR (Whiskers) approach:")
print(f"Number of data points: {num_data_cleaned_iqr}")
print(f"Min price: {min_price_cleaned_iqr}, Max price: {max_price_cleaned_iqr}")
print(f"Mean price: {mean_price_cleaned_iqr}, Std price: {std_price_cleaned_iqr}\n")

Comparison of Results:
Original dataset:
Number of data points: 48895
Min price: 0, Max price: 10000
Mean price: 152.7206871868289, Std price: 240.15416974718758

Cleaned dataset using Z-score approach:
Number of data points: 48507
Min price: 0, Max price: 860
Mean price: 138.74690250891624, Std price: 107.5582327130842

Cleaned dataset using IQR (Whiskers) approach:
Number of data points: 45923
Min price: 0, Max price: 334
Mean price: 119.97031988328288, Std price: 68.15014770332262

